MapReduce: How can I output key/value pair without newlines?

MapReduce: How can I output key/value pair without newlines? - java

I am using a 0 reduce approach to my problem. I wish to preprocess data from one file and then to write it out as another file, but with no new lines and tab delimeters? How can I output my map job that has processed my data with the same file format it came in minus the preprocess.
That is, I have something like this:
Preprocess:
<TITLE> Herp derp </Title> I am a major general
Post Process:
Herp
Derp
I
am
a
major
general
What I want it to do is this:
Herp Derp I am a major general
I believe the issue is with this line of code:
job.setOutputFormatClass(TextOutputFormat.class);
However, when I tried, quite naively to do something like:
job.setOutputFormatClass(null);
It obviously would not work. Is there an format class that is provided that I can use to do this? If not, how could I write my own class to just output everything as I want? I am new to hadoop and map reduce.
I have included my map function below. I do not want to use reduce as it would sort between the map and reducer.
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
//Did preprocessing here, irrelevant to my problem
context.write(word, null);
}
}
Also, I have also googled this and read the apache hadoop api to see if I can gleam an answer.

On your mapper class, instead of parsing your line into individual words and writing them out, try sending the entire line to the
context.write(word, null);
That way it is keeping the entire string you are originally working with together, instead of sending out the line piece by piece.
So, cut your string apart for the preprocess work, then put it back together when you send it out with the context.write command.

If your mapper is writing multiple records containing the individual tokens from a single input line, then you will absolutely need a reducer to group those tokens back together into a single line for output. You can't do this without a reducer.

Related

Using I/O stream to parse CSV file

I have a CSV file of US population data for every county in the US. I need to get each population from the 8th column of the file. I'm using a fileReader() and bufferedStream() and not sure how to use the split method to accomplish this. I know this isn't much information but I know that I'll be using my args[0] as the destination in my class.
I'm at a loss to where to being to be honest.
import java.io.FileReader;
public class Main {
public static void main(String[] args) {
BufferedReader() buff = new BufferedReader(new FileReader(args[0]));
String
}
try {
}
}
The output should be an integer of the total US population. Any help with pointing me in the right direction would be great.

Don't reinvent the wheel, don't parse CSV yourself: use a library. Even such a simple format as CSV has nuances: fields can be escaped with quotes or unescaped, the file can have or have not a header and so on. Besides that you have to test and maintain the code you've wrote. So writing less code and reusing libraries is good.
There are a plenty of libraries for CSV in Java:
Apache Commons CSV
OpenCSV
Super CSV
Univocity
flatpack
IMHO, the first two are the most popular.
Here is an example for Apache Commons CSV:
final Reader in = new FileReader("counties.csv");
final Iterable<CSVRecord> records = CSVFormat.DEFAULT.parse(in);
for (final CSVRecord record : records) { // Simply iterate over the records via foreach loop. All the parsing is handler for you
String populationString = record.get(7); // Indexes are zero-based
String populationString = record.get("population"); // Or, if your file has headers, you can just use them
… // Do whatever you want with the population
}
Look how easy it is! And it will be similar with other parsers.

Flink: How to pass a dynamic path while writing to files using writeFileAsText(path)?

Let's say I have a Stream with elements of type String. I want to write each element in the stream to a separate file in some folder. I'm using the following set up.
stream.writeAsText(path).setParallelism(1);
How do I make this path dynamic? I even tried adding System.nanotime() to the path to make it dynamic. But it still doesn't seem to work, everything gets written to a single file.

This sort of use case is explicitly supported in Flink by the Rolling File Sink with a custom bucketer, or the newer and prefered Streaming File Sink with a custom BucketAssigner and RollingPolicy.

Your problem is that DataStream.writeAsText() writes the entire content of the stream to the file at once, so you will only ever get a single file.
It looks like this will return a collection that you can use to output your strings as different files.
dataStream.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out)
throws Exception {
for(String word: value.split(" ")){
out.collect(word);
}
}
});
Taken straight from the documentation here: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html

Univocity parser - Handling lines with weird constructs

I am trying to figure the best way to use University parser to handle CSV log file with lines looks like below,
"23.62.3.74",80,"testUserName",147653,"Log Collection Device 100","31/02/15 00:05:10 GMT",-1,"10.37.255.3","TCP","destination_ip=192.62.3.74|product_id=0071|option1_type=(s-dns)|proxy_machine_ip=10.1.255.3"
As you can see this is a comma delimited file but the last column has got bunch of values prefixed with its field names. My requirement is to capture values from normal fields and
selectively from this last big field.
I know the master details row processor in Univocity but I doubt if this fit into that category. Could you guide me to the right direction please?
Note: I can handle the name prefixed fields in rowProcessed(String[] row, ParsingContext context) if I implement a row processor but I am looking for something native to Univocity if possible?
Thanks,
R

There's nothing native in the parser for that. Probably the easiest way to go about it is to have your RowProcessor as you mentioned.
One thing you can try to do to make your life easier is to use another instance of CsvParser to parse that last record:
//initialize a parser for the pipe separated bit
CsvParserSettings detailSettings = new CsvParserSettings();
detailSettings.getFormat().setDelimiter('=');
detailSettings.getFormat().setLineSeparator("|");
CsvParser detailParser = new CsvParser(detailSettings);
//here is the content of the last column (assuming you got it from the parser)
String details = "destination_ip=192.62.3.74|product_id=0071|option1_type=(s-dns)|proxy_machine_ip=10.1.255.3";
//The result will be a list of pairs
List<String[]> pairs = detailParser.parseAll(new StringReader(details));
//You can add the pairs to a map
Map<String, String> map = new HashMap<String, String>();
for (String[] pair : pairs) {
map.put(pair[0], pair[1]);
}
//this should print: {destination_ip=192.62.3.74, product_id=0071, proxy_machine_ip=10.1.255.3, option1_type=(s-dns)}
System.out.println(map);
That won't be extremely fast but at least it's easy to work with a map if that input can have random column names and values associated with them.

Hadoop writing to a new file from mapper

I am trying to write a program that takes a huge data set and then run some queries on it using mapreduce. I have a code like this:
public static class MRMapper
extends Mapper<LongWritable, Text, Text, IntWritable>{
String output2="hdfs://master:9000/user/xxxx/indexln.txt";
FileSystem Phdfs =FileSystem.get(new Configuration());
Path fname1=new Path(output2);
BufferedWriter out=new BufferedWriter(new OutputStreamWriter(Phdfs.create(fname1,true)));
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
long max=0;
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
String binln = Long.toBinaryString(0x8000000000000000L | key).substring(1);
out2.write(binln+"\n");
out2.flush();
String line = value.toString();
String [] ST = line.split(",");
long val=Math.abs(Long.parseLong(ST[2]));
if (max < val){
max= val;
}
else{
word.set(line);
context.write(word, val);
}
}
}
What I am trying to do is to build an indexfile in the mapper. which would be used to access specific areas of the input file by the mappers. The mappers read a part of the input file based on the index and then print the part read and the number of lines read to the output. I am using one mapper with 9 reducers.
My question goes, is it possible to create/write to a file different from the output file in the map function and also, can a reducer read a file that is open in the mapper? If yes, Am i on the right path or totally wrong or maybe mapreduce is not the way for this? I apologize if this question sounds too noob but I'm actually a noob in hadoop. Trying to learn. thanks

Are you sure your are using a single mapper? Because Hadoop creates a number of mappers very close to the number of input splits (more details).
The concept of input split is very important as well: it means very big data files are splited into several chuncks, each chunck assigned to a mapper. Thus, unless you are totally sure only one mapper is being used, you wont be able to control which part of the file you are workin on, and you will not be able to control any kind of global index.
Being said that, by using a single mapper in MapReduce is the same than not using MapReduce at all :) Maybe the mistake is mine, and I'm assuming you have only one file to be analyzed, is that the case?
In the case you have several big data files the scenario changes, and it could make sense to create a single mapper for each file, but you will have to create your own InputSplit and override the isSplitable method by returning always false.

JSon to CSV with Java using CDL: possible to replace comma-sep. by semi-colum sep. values?

Everything is in the title :)
I'm using org.json.CDL to convert JSONArray into CSV data but it renders a string with ',' as separator.
I'd like to know if it's possible to replace with ';' ?
Here is a simple example of what i'm doing:
public String exportAsCsv() throws Exception {
return CDL.toString(
new JSONArray(
mapper.writeValueAsString(extractAccounts()))
);
}
Thanks in advance for any advice on that question.
Edit: No replacement solution of course, as this could have impact for large data, and of course the library used enable me to specify the field separator.
Edit2: Finally the solution to extract data as JSONArray (and String...) was not very good, especially for large data file.
So i made the following changes:
use a Java CSV library (for example: http://www.csvreader.com/java_csv_samples.php)
refactor code to stream data from json input source to csv output source
This is nicer for large data treatment. If you have comments do not hesitate.

String output = "Hello,This,is,separated,by,a,comma";
// Simple call the replaceAll method.
output = output.replace(',',';');
I found this in the String documentation.
Example
String value = "Hello,tthis,is,a,string";
value = value.replace(',', ';');
System.out.println(value);
// Outputs: Hello;tthis;is;a;string

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

MapReduce: How can I output key/value pair without newlines? - java

If your mapper is writing multiple records containing the individual tokens from a single input line, then you will absolutely need a reducer to group those tokens back together into a single line for output. You can't do this without a reducer.

Related

Using I/O stream to parse CSV file

Flink: How to pass a dynamic path while writing to files using writeFileAsText(path)?

Univocity parser - Handling lines with weird constructs

Hadoop writing to a new file from mapper

JSon to CSV with Java using CDL: possible to replace comma-sep. by semi-colum sep. values?

Categories

Resources