Sort csv file by key in Apache Spark

Sort csv file by key in Apache Spark - java

I have a csv file that consists of data in this format:
id, name, surname, morecolumns
5, John, Lok, more
2, John2, Lok2, more
1, John3, Lok3, more
etc..
I want to sort my csv file using the id as key and store the sorted results in another file.
What I've done so far in order to create JavaPairs of (id, rest_of_line).
SparkConf conf = new SparkConf().setAppName.....;
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> file = sc.textFile("inputfile.csv");
// extract the header
JavaRDD<String> lines = file.filter(s -> !s.equals(header));
// create JavaPairs
JavaPairRDD<Integer, String> pairRdd = lines.mapToPair(
new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(final String line) {
String str = line.split(",", 2)[0];
String str2 = line.split(",", 2)[1];
int id = Integer.parseInt(str);
return new Tuple2(id, str2);
}
});
// sort and save the output
pairRdd.sortByKey(true, 1);
pairRdd.coalesce(1).saveAsTextFile("sorted.csv");
This works in cases that I have small files. However when I am using bigger files, the output is not sorted properly. I think this happens because the sort procedure takes place on different nodes, so the merge of all the procedures from all the nodes doesn't give the expected output.
So, the question is how can I sort my csv file using the id as key and store the sorted results in another file.

The method coalesce is probably the one to blame, as it apparently does not contractually guarantee the ordering or the resulting RDD (see Which operations preserve RDD order?). So if you avoid such coalesce, the resulting output files will be ordered.
As you want a unique csv file, you could get the results from whatever file-system you're using but taking care of their actual order, and merge them. For example, if you're using HDFS (as stated by #PinoSan) this can be done using the command hdfs dfs -getmerge <hdfs-output-dir> <local-file.csv>.

As pointed by #mauriciojost, you should not do coalesce.
Instead, better way to do this is pairRdd.sortByKey(true,pairRdd.getNumPartitions()).saveAsTextFile(path) so that maximum possible work is carried out on partitions that hold data.

Related

Importing a Large Amount of Data and Searching Efficiently

I'm currently writing a program that takes in two CSVs - one containing database keys (and other information irrelevant to the current issue), the other being an asset manifest. The program checks the database key from the first CSV, queries an online database to retrieve the asset key, then gets the asset status from the second CSV. (This is a workaround to a stupid API issue.)
My problem is that while the CSV that is being iterated over is relatively short - only about 300 lines long usually - the other is an asset manifest that is easily 10000 lines long (and sorted, though not by the key I can obtain from the first CSV). I obviously don't want to iterate over the entire asset manifest for every single input line, since that will take roughly 10 eternities.
I'm a fairly inexperienced programmer, so I only know of sorting/searching algorithms, and I definitely don't know what would be the one to use for this. What algorithm would be the most efficient? Is there a way to "batch-query" the manifest for all of the assets listed in the input CSV that would be faster than searching the manifest individually for each key? Or should I use a tree or hashtable or something else I heard mentioned in other SE threads? I don't know anything about the performance implications of any of these.
I can format the manifest as needed when it's input (it's just copy-pasted into a GUI), so I guess I could iterate over the entire manifest when it's input and make a hashtable of key:line pairs and then search that? Or I could turn it into a 2D array and just search the specified index? Those are all I can think of.
Problem is, I don't know how much time computer operations like that take, and if that would just waste time or actually improve performance.
P.s. I'm using Java for this currently since it's all I know, but if another language would be faster then I'm all ears.

The simple solution will be creating a HashMap, iterating through one of the files and add each line of that file to the HashMap(with corresponding key and value), then iterate through the other one and see if the created HashMap contains the key, if yes add the data to anotherHashMap, then after iteration return the second HashMap.
Imagine we have test1.csv file with the content such key,name,family as below:
5000,ehsan,tashkhisi
2,ali,lllll
3,amel,lllll
1,azio,skkk
And test2.csv file with the content such key,status like below:
1000,status1
1,status2
5000,status3
4000,status4
4001,status1
4002,status3
5,status1
We want to have output like this:
1 -> status2
5000 -> status3
Simple code will be like below:
Java 8 Stream:
private static Map<String, String> findDataInTwoFilesJava8() throws IOException {
Map<String, String> map =
Files.lines(Paths.get("/tmp/test2.csv")).map(a -> a.split(","))
.collect(Collectors.toMap((a -> a[0]), (a -> a[1])));
return Files.lines((Paths.get("/tmp/test1.csv"))).map(a -> a.split(","))
.filter(a -> map.containsKey(a[0]))
.collect(Collectors.toMap(a -> a[0], a -> map.get(a[0])));
}
Simple Java:
private static Map<String, String> findDataInTwoFiles() throws IOException {
String line;
Map<String, String> map = new HashMap<>();
BufferedReader br = new BufferedReader(new FileReader("/tmp/test2.csv"));
while ((line = br.readLine()) != null) {
String[] lienData = line.split(",");
map.put(lienData[0], lienData[1]);
}
Map<String, String> resultMap = new HashMap<>();
br = new BufferedReader(new FileReader("/tmp/test1.csv"));
while ((line = br.readLine()) != null) {
String key = line.split(",")[0];
if(map.containsKey(key))
resultMap.put(key, map.get(key));
}
return resultMap;
}

Hadoop Mapreduce : values to reducer are in reverse order

I will be doing the following in a much bigger file. for now,I have an example input file with the following values.
1000,SMITH,JERRY
1001,JOHN,TIA
1002,TWAIN,MARK
1003,HARDY,DENNIS
1004,CHILD,JACK
1005,CHILD,NORTON
1006,DAVIS,JENNY
1007,DAVIS,KAREN
1008,MIKE,JOHN
1009,DENNIS,SHERIN
now what i am doing is running a mapreduce job to encrypt the last name of each record and write back an output. and i am using the mapper partition number as the key and the modified text as value.
so the output from mapper will be,
0 1000,Mj4oJyk=,,JERRY
0 1001,KzwpPQ,TIA
0 1002,NSQgOi8,MARK
0 1003,KTIzNzg,DENNIS
0 1004,IjsoPyU,JACK
0 1005,IjsoPyU,NORTON
0 1006,JTI3OjI,JENNY
0 1007,JTI3OjI,KAREN
0 1008,LDoqNg,JOHN
0 1009,JTYvPSgg,SHERIN
I don't want any sorting to be done.I also use a reducer because, in case of a larger file, there will be multiple mappers and if no reducer, multiple output files will be written. so i use a single reduce to merge values from all mappers and write to single file.
now the input values to reducer comes in reversed order and in the order from mapper. it is like the following,
1009,JTYvPSgg,SHERIN
1008,LDoqNg==,JOHN
1007,JTI3OjI=,KAREN
1006,JTI3OjI=,JENNY
1005,IjsoPyU=,NORTON
1004,IjsoPyU=,JACK
1003,KTIzNzg=,DENNIS
1002,NSQgOi8=,MARK
1001,KzwpPQ==,TIA
1000,Mj4oJyk=,JERRY
Why is it reversing the order? and how can i maintain the same order from mapper? any suggestions will be helpfull
EDIT 1 :
the Driver code is,
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJobName("encrypt");
job.setJarByClass(TestDriver.class);
job.setMapperClass(TestMap.class);
job.setNumReduceTasks(1);
job.setReducerClass(TestReduce.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(hdfsInputPath));
FileOutputFormat.setOutputPath(job, new Path(hdfsOutputPath));
System.exit(job.waitForCompletion(true) ? 0 : 1);
the mapper code is,
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
TaskID taskId = context.getTaskAttemptID().getTaskID();
int partition = taskId.getId();
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
mask(inputvalues);
context.write(new IntWritable(partition), new Text(stringBuilder.toString()));
The reducer code is,
for(Text value : values) {
context.write(new Text(value), null);
}

The base idea of MapReduce is that the order in which things are done is irrelevant.
So you cannot (and do not need to) control the order in which
the input records go through the mappers.
the key and related values go through the reducers.
The only thing you can control is the order in which the values are placed in the iterator that is made available in the reducer.
For that you can use the Object key to maintain the order of values.
The LongWritable part (or the key) is the position of the line in the file (Not line number, but position from start of file).
You can use that part to keep track of which line was first.
Then your mapper code will be changed to
protected void map(Object key, Text value, Mapper<Object, Text, LongWritable, Text>.Context context)
throws IOException, InterruptedException {
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
mask(inputValues);
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
context.write(new LongWritable(((LongWritable) key).get()), value);
}
Note: you can change all IntWritable to LongWritable in your code but be careful.

inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
TaskID taskId = context.getTaskAttemptID().getTaskID();
//preserve the number value for sorting
IntWritable idNumber = new IntWritable(Integer.parseInt(inputValue[0])
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
mask(inputvalues);
context.write(idNumber, new Text(stringBuilder.toString()));
I made some assumptions because you did not have the full code of the mapper. I assumed that inputValues was a string array due to the toString() output. The first value of the array should be the number value from your input, however it is now a string. You must convert the number back to IntWritable to match what your mapper is emitting IntWritable,Text. The hadoop framework will sort by key and with the key being of type IntWritable it will sort in ascending order. The code you provided is using the task ID and from reading the API https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/TaskAttemptID.html#getTaskID() It was unclear whether this would provide an order to your values as you desired. To control the order of output I would recommend using the first value of your string array and convert to IntWritable. I don't know if this violates your intent to mask the inputValues.
EDIT
To follow up with your comment. You can simply multiply the partition by -1 this will cause the hadoop framework to reverse the order.
int partition = -1*taskId.getId();

Splitting and storing input in a variable in Apache spark using java

I have two files as input , First file will be having role number and subject1 mark and second file will have role number and subject2 mark, First file will be coming in spark streaming and second file will be in my hdfs, How i can split the file like key, value pair and extract the value and store it in a variable using java in spark, I tried but having difficulty in extracting and storing as integer in a variable using javapairrdd. Thanks in advance for the help.
JavaRDD<String> sub1MarksRDD = sc.textFile("/user/ubuntu/sub1Marks.dat");
List<String> ccList = new ArrayList<String>();
ccList = sub1MarksRDD.collect();
JavaRDD<String> sub2MarksRDD = sc.textFile("/user/ubuntu/sub2marks.dat");
JavaPairRDD<String, Integer> result = sub1MarksRDD.mapToPair(
new PairFunction<String,String,Integer>() {
public Tuple2<String, Integer> call(String w) {
return new Tuple2<String, Integer>(w, 1);
}
}
);
How should we go ahead to create a pair rdd to map the role no,marks1 from sub1Marks.dat with the data in sub2Marks.dat. How to extract the marks fields based on the role no and store it to a variable.

Copy table in HBase from Java

I want to copy data from one HBase table to another using Java APIs, but not able to find one. Is there any Java API to do the same?
Thanks.

The following is not by far the most optimized way - but from the tone of the question it seems performance is not the critical factor here.
First, you need to set up your HBaseConfiguration and your input / output tables:
Configuration config = HBaseConfiguration.create();
HTable inputTable = new HTable(config, "input_table");
HTable outputTable = new HTable(config, "output_table");
What you want is a "Scan", which allows a range scan to be performed. You need to define the query parameters, by adding columns to a Scan object.
Scan scan = new Scan(Bytes.toBytes("smith-"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("givenName"));
scan.addColumn(Bytes.toBytes("contactinfo"), Bytes.toBytes("email"));
scan.setFilter(new PageFilter(25));
Now you are ready to invoke the scan object and retrieve results:
ResultScanner scanner = inputTable.getScanner(scan);
for (Result result : scanner) {
putToOutputTable(result);
}
Now to save to the second table, you will either do Put's within the for loop, or aggregate the results into a List/Array or similar for a bulk put.
protected void putToOutputTable(Result result) {
// Retrieve the Map of families to their most recent qualifiers and values.
NavigableMap<byte[],NavigableMap<byte[],byte[]>> map = result.getNoVersionMap();
for ( // iterate through the family/values map entries for this result ) {
// Convert the result to the row key and the column values here ..
// specifically set the rowKey, colFamily, colQualifier, and colValue(s)
Put p = new Put(Bytes.toBytes(rowKey));
// To set the value you'd like to update in the row 'myLittleRow',
// specify the column family, column qualifier, and value of the table
// cell you'd like to update. The column family must already exist
// in your table schema. The qualifier can be anything.
// All must be specified as byte arrays as hbase is all about byte
// arrays. Lets pretend the table 'myLittleHBaseTable' was created
// with a family 'myLittleFamily'.
p.add(Bytes.toBytes(colFamily), Bytes.toBytes(colQualifier),
Bytes.toBytes(colValue));
}
table.put(p);
}
If instead you want a more scalable version, take a look at how to use map/reduce to read from input hdfs files / write to output hbase tables here: Hbase Map/Reduce

multiple inputs and grouping comparator

I have inputs from two sources:
map output in the form,
output.collect(new StockKey(Text(x+" "+id), new Text(id2)), new Text(data));
map output in the form,
output.collect(new StockKey(new Text(x+" "+id), new Text(1), new Text(data));
Job conf:
conf.setPartitionerClass(CustomPartitioner.class);
conf.setValueGroupingComparatorClass(StockKeyGroupingComparator.class);
where StockKey is a custom class of format (new Text(), new Text());
Constructor:
public StockKey(){
this.symbol = new Text();
this.timestamp = new Text();
}
Grouping comparator:
public class StockKeyGroupingComparator extends WritableComparator {
protected StockKeyGroupingComparator() {
super(StockKey.class, true);
}
public int compare(WritableComparable w1, WritableComparable w2){
StockKey k1 = (StockKey)w1;
StockKey k2 = (StockKey)w2;
Text x1 = new Text(k1.getSymbol());
Text x2 = new Text(k2.getSymbol());
return x1.compareTo(x2);
}
}
But I'm not receiving the map output values from input
I'm getting only the map output value reaches the reducer. I want the the records which have the symbol viz new Text(x+" "+id) which are common from both the map outputs to be grouped to the same reducer. I am struck here.
Please help!

To do this you need a Partitioner which fits in as follows:
Your mappers output a bunch of records as key/value pairs
For each record, the partitioner is passed the key, the value and the number of reducers. The partitioner decides which reducer will handle the record
The records are shipped off to their respective partitions (reducers)
The GroupingComparator is run to decide which key value pairs get grouped into an iterable for a single call to the reducer() method
and so on...
I think the default partitioner is choosing the reducer partition for each record based on the entire value of your key (that's the default behavior). But you want records grouped by only part of the key (just the symbol and not the symbol and timestamp). So you need to write a partitioner that does this and specify/configure it in the driver class.
Once you do that, you're grouping comparator should help group the records as you've intended.
EDIT: random thoughts
You might make things easier on yourself if you moved the timestamp to the value, making the key simple (just the symbol) and the value complex (timestamp and value). Then you wouldn't need a partitioner or a grouping comparator.
You didn't say either way, but you did use the MultipleInputs class, right? That's the only way to invoke two or more mappers for the same job.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.