Spark load a csv into JavaPairRDD by key found in row

Spark load a csv into JavaPairRDD by key found in row - java

I want to load a csv into a JavaPairRDD, using a value in the row as the key, and the row itself as the value. Currently I am doing it like this:
I have a csv that has lines like this:
a,1,1,2
b,1,1,2
a,2,2,3
b,2,2,3
I have a java object that represents these rows like this:
public class FactData implements Serializable{
public String key;
public int m1;
public int m2;
public int m3;
}
I'm currently getting to the pairRDD like this:
JavaRDD<FactData> lines = sc.textFile("test.csv").map(line -> FactData.fromFileLine(line));
JavaPairRDD<String, Iterable<FactData>> groupBy = lines.groupBy(row -> row.getId());
But I am wondering if there is a faster/better way to do this? something like:
JavaPairRDD<String,Iterable<FactData>> groupedLines = sc.textFile("test.csv").flatMapToPair(new PairFlatMapFunction<String, String, Iterable<FactData>>() {
#Override
public Iterator<Tuple2<String, Iterable<FactData>>> call(String s) throws Exception {
//WHAT GOES IN HERE?
return null;
}
});
Any ideas appreciated.

Why dont you use keyBy.?
Let's say, you want first value of the file as key and total line as value.
Than you can do this simply.
JavaRDD<String> lines = context.textFile("test.csv");
JavaPairRDD<String, String> newLines = lines.keyBy(new Function<String,String>(){
#Override
public String call(String arg0) throws Exception {
return arg0.split(",")[0];
}
});
If you want collect as Map, may be you can do this.
JavaPairRDD<String, Iterable<String>> newLines = lines.keyBy(new Function<String,String>(){
#Override
public String call(String arg0) throws Exception {
return arg0.split(",")[0];
}
}).mapValues(new Function<String, Iterable<String>>(){
#Override
public Iterable<String> call(String arg0) throws Exception {
return Arrays.asList(arg0.split(","));
}
});

Related

Apache Flink Process xml and write them to database

i have the following use case.
Xml files are written to a kafka topic which i want to consume and process via flink.
The xml attributes have to be renamed to match the database table columns. These renames have to be flexible and maintainable from outside the flink job.
At the end the attributes have to be written to the database.
Each xml document repesent a database record.
As a second step all some attributes of all xml documents from the last x minutes have to be aggregated.
As i know so far flink is capable of all the mentioned steps but i am lacking of an idea how to implement it corretly.
Currently i have implemented the kafka source, retrieve the xml document and parse it via custom MapFunction. There i create a POJO and store each attribute name and value in a HashMap.
public class Data{
private Map<String,String> attributes = HashMap<>();
}
HashMap containing:
Key: path.to.attribute.one Value: Value of attribute one
Now i would like to use the Broadcasting State to change the original attribute names to the database column names.
At this stage i stuck as i have my POJO data with the attributes inside the HashMap but i don't know how to connect it with the mapping via Broadcasting.
Another way would be to flatMap the xml document attributes in single records. This leaves me with two problems:
How to assure that attributes from one document don't get mixed with them from another document within the stream
How to merge all the attributes of one document back to insert them as one record into the database
For the second stage i am aware of the Window function even if i don't have understood it in every detail but i guess it would fit my requirement. The question on this stage would be if i can use more than one sink in one job while one would be a stream of the raw data and one of the aggregated.
Can someone help with a hint?
Cheers
UPDATE
Here is what i got so far - i simplified the code the XmlData POJO is representing my parsed xml document.
public class StreamingJob {
static Logger LOG = LoggerFactory.getLogger(StreamingJob.class);
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
XmlData xmlData1 = new XmlData();
xmlData1.addAttribute("path.to.attribute.eventName","Start");
xmlData1.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:00.000");
xmlData1.addAttribute("third.path.to.attribute.eventSource","Source1");
xmlData1.addAttribute("path.to.attribute.additionalAttribute","Lorem");
XmlData xmlData2 = new XmlData();
xmlData2.addAttribute("path.to.attribute.eventName","Start");
xmlData2.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:01.000");
xmlData2.addAttribute("third.path.to.attribute.eventSource","Source2");
xmlData2.addAttribute("path.to.attribute.additionalAttribute","First");
XmlData xmlData3 = new XmlData();
xmlData3.addAttribute("path.to.attribute.eventName","Start");
xmlData3.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:01.000");
xmlData3.addAttribute("third.path.to.attribute.eventSource","Source1");
xmlData3.addAttribute("path.to.attribute.additionalAttribute","Day");
Mapping mapping1 = new Mapping();
mapping1.addMapping("path.to.attribute.eventName","EVENT_NAME");
mapping1.addMapping("second.path.to.attribute.eventTimestamp","EVENT_TIMESTAMP");
DataStream<Mapping> mappingDataStream = env.fromElements(mapping1);
MapStateDescriptor<String, Mapping> mappingStateDescriptor = new MapStateDescriptor<String, Mapping>(
"MappingBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<Mapping>() {}));
BroadcastStream<Mapping> mappingBroadcastStream = mappingDataStream.broadcast(mappingStateDescriptor);
DataStream<XmlData> dataDataStream = env.fromElements(xmlData1, xmlData2, xmlData3);
//Convert the xml with all attributes to a stream of attribute names and values
DataStream<Tuple2<String, String>> recordDataStream = dataDataStream
.flatMap(new CustomFlatMapFunction());
//Map the attributes with the mapping information
DataStream<Tuple2<String,String>> outputDataStream = recordDataStream
.connect(mappingBroadcastStream)
.process();
env.execute("Process xml data and write it to database");
}
static class XmlData{
private Map<String,String> attributes = new HashMap<>();
public XmlData(){
}
public String toString(){
return this.attributes.toString();
}
public Map<String,String> getColumns(){
return this.attributes;
}
public void addAttribute(String key, String value){
this.attributes.put(key,value);
}
public String getAttributeValue(String attributeName){
return attributes.get(attributeName);
}
}
static class Mapping{
//First string is the attribute path and name
//Second string is the database column name
Map<String,String> mappingTuple = new HashMap<>();
public Mapping(){}
public void addMapping(String attributeNameWithPath, String databaseColumnName){
this.mappingTuple.put(attributeNameWithPath,databaseColumnName);
}
public Map<String, String> getMappingTuple() {
return mappingTuple;
}
public void setMappingTuple(Map<String, String> mappingTuple) {
this.mappingTuple = mappingTuple;
}
}
static class CustomFlatMapFunction implements FlatMapFunction<XmlData, Tuple2<String,String>> {
#Override
public void flatMap(XmlData xmlData, Collector<Tuple2< String,String>> collector) throws Exception {
for(Map.Entry<String,String> entrySet : xmlData.getColumns().entrySet()){
collector.collect(new Tuple2<>(entrySet.getKey(), entrySet.getValue()));
}
}
}
static class CustomBroadcastingFunction extends BroadcastProcessFunction {
#Override
public void processElement(Object o, ReadOnlyContext readOnlyContext, Collector collector) throws Exception {
}
#Override
public void processBroadcastElement(Object o, Context context, Collector collector) throws Exception {
}
}
}

Here's some example code of how to do this using a BroadcastStream. There's a subtle issue where the attribute remapping data might show up after one of the records. Normally you'd use a timer with state to hold onto any records that are missing remapping data, but in your case it's unclear whether a missing remapping is a "need to wait longer" or "no mapping exists". In any case, this should get you started...
private static MapStateDescriptor<String, String> REMAPPING_STATE = new MapStateDescriptor<>("remappings", String.class, String.class);
#Test
public void testUnkeyedStreamWithBroadcastStream() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(2);
List<Tuple2<String, String>> attributeRemapping = new ArrayList<>();
attributeRemapping.add(new Tuple2<>("one", "1"));
attributeRemapping.add(new Tuple2<>("two", "2"));
attributeRemapping.add(new Tuple2<>("three", "3"));
attributeRemapping.add(new Tuple2<>("four", "4"));
attributeRemapping.add(new Tuple2<>("five", "5"));
attributeRemapping.add(new Tuple2<>("six", "6"));
BroadcastStream<Tuple2<String, String>> attributes = env.fromCollection(attributeRemapping)
.broadcast(REMAPPING_STATE);
List<Map<String, Integer>> xmlData = new ArrayList<>();
xmlData.add(makePOJO("one", 10));
xmlData.add(makePOJO("two", 20));
xmlData.add(makePOJO("three", 30));
xmlData.add(makePOJO("four", 40));
xmlData.add(makePOJO("five", 50));
DataStream<Map<String, Integer>> records = env.fromCollection(xmlData);
records.connect(attributes)
.process(new MyRemappingFunction())
.print();
env.execute();
}
private Map<String, Integer> makePOJO(String key, int value) {
Map<String, Integer> result = new HashMap<>();
result.put(key, value);
return result;
}
#SuppressWarnings("serial")
private static class MyRemappingFunction extends BroadcastProcessFunction<Map<String, Integer>, Tuple2<String, String>, Map<String, Integer>> {
#Override
public void processBroadcastElement(Tuple2<String, String> in, Context ctx, Collector<Map<String, Integer>> out) throws Exception {
ctx.getBroadcastState(REMAPPING_STATE).put(in.f0, in.f1);
}
#Override
public void processElement(Map<String, Integer> in, ReadOnlyContext ctx, Collector<Map<String, Integer>> out) throws Exception {
final ReadOnlyBroadcastState<String, String> state = ctx.getBroadcastState(REMAPPING_STATE);
Map<String, Integer> result = new HashMap<>();
for (String key : in.keySet()) {
if (state.contains(key)) {
result.put(state.get(key), in.get(key));
} else {
result.put(key, in.get(key));
}
}
out.collect(result);
}
}

Spark UDF written in Java Lambda raises ClassCastException

Here's the exception:
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to ... of type org.apache.spark.sql.api.java.UDF2 in instance of ...
If I don't implement the UDF by Lambda expression, it's ok. Like:
private UDF2 funUdf = new UDF2<String, String, String>() {
#Override
public String call(String a, String b) throws Exception {
return fun(a, b);
}
};
dataset.sparkSession().udf().register("Fun", funUdf, DataTypes.StringType);
functions.callUDF("Fun", functions.col("a"), functions.col("b"));
I am running in local so this answer will not help: https://stackoverflow.com/a/28367602/4164722
Why? How can I fix it?

This is a working solution :
UDF1 myUDF = new UDF1<String, String>() {
public String call(final String str) throws Exception {
return str+"A";
}
};
sparkSession.udf().register("Fun", myUDF, DataTypes.StringType);
Dataset<Row> rst = sparkSession.read().format("text").load("myFile");
rst.withColumn("nameA",functions.callUDF("Fun",functions.col("name")))

apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream?

I am a newbie with apache flink. I have an unbound data stream in my input (fed into flink 0.10 via kakfa).
I want to get the 1st occurence of each primary key (the primary key is the contract_num and the event_dt).
These "duplicates" occur nearly immediately after each other.
The source system cannot filter this for me, so flink has to do it.
Here is my input data:
contract_num, event_dt, attr
A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:08, Y
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C
Here is the output data I want:
A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C
Note the 2nd row has been removed as the key combination of A001 and '2016-02-24 10:25:08' already occurred in the 1st row.
How can I do this with flink 0.10?
I was thinking about using keyBy(0,1) but after that I don't know what to do!
(I used joda-time and org.flinkspector to setup these tests).
#Test
public void test() {
DateTime threeSecondsAgo = (new DateTime()).minusSeconds(3);
DateTime twoSecondsAgo = (new DateTime()).minusSeconds(2);
DateTime oneSecondsAgo = (new DateTime()).minusSeconds(2);
DataStream<Tuple3<String, Date, String>> testStream =
createTimedTestStreamWith(
Tuple3.of("A1", threeSecondsAgo.toDate(), "X"))
.emit(Tuple3.of("A1", threeSecondsAgo.toDate(), "Y"), after(0, TimeUnit.NANOSECONDS))
.emit(Tuple3.of("A1", twoSecondsAgo.toDate(), "Z"), after(0, TimeUnit.NANOSECONDS))
.emit(Tuple3.of("A2", oneSecondsAgo.toDate(), "C"), after(0, TimeUnit.NANOSECONDS))
.close();
testStream.keyBy(0,1);
}

Filtering duplicates over an infinite stream will eventually fail if your key space is larger than your available storage space. The reason is that you have to store the already seen keys somewhere to filter out the duplicates. Thus, it would be good to define a time window after which you can purge the current set of seen keys.
If you're aware of this problem but want to try it anyway, you can do it by applying a stateful flatMap operation after the keyBy call. The stateful mapper uses Flink's state abstraction to store whether it has already seen an element with this key or not. That way, you will also benefit from Flink's fault tolerance mechanism because your state will be automatically checkpointed.
A Flink program doing your job could look like
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple3<String, Date, String>> input = env.fromElements(Tuple3.of("foo", new Date(1000), "bar"), Tuple3.of("foo", new Date(1000), "foobar"));
input.keyBy(0, 1).flatMap(new DuplicateFilter()).print();
env.execute("Test");
}
where the implementation of DuplicateFilter depends on the version of Flink.
Version >= 1.0 implementation
public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {
static final ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class, false);
private ValueState<Boolean> operatorState;
#Override
public void open(Configuration configuration) {
operatorState = this.getRuntimeContext().getState(descriptor);
}
#Override
public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
if (!operatorState.value()) {
// we haven't seen the element yet
out.collect(value);
// set operator state to true so that we don't emit elements with this key again
operatorState.update(true);
}
}
}
Version 0.10 implementation
public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {
private OperatorState<Boolean> operatorState;
#Override
public void open(Configuration configuration) {
operatorState = this.getRuntimeContext().getKeyValueState("seen", Boolean.class, false);
}
#Override
public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
if (!operatorState.value()) {
// we haven't seen the element yet
out.collect(value);
operatorState.update(true);
}
}
}
Update: Using a tumbling time window
input.keyBy(0, 1).timeWindow(Time.seconds(1)).apply(new WindowFunction<Iterable<Tuple3<String,Date,String>>, Tuple3<String, Date, String>, Tuple, TimeWindow>() {
#Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple3<String, Date, String>> input, Collector<Tuple3<String, Date, String>> out) throws Exception {
out.collect(input.iterator().next());
}
})

Here's another way to do this that I happen to have just written. It has the disadvantage that it's a bit more custom code since it doesn't use the built-in Flink windowing functions but it doesn't have the latency penalty that Till mentioned. Full example on GitHub.
package com.dataartisans.filters;
import com.google.common.cache.CacheBuilder;
import com.google.common.cache.CacheLoader;
import com.google.common.cache.LoadingCache;
import org.apache.flink.api.common.functions.RichFilterFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.checkpoint.CheckpointedAsynchronously;
import java.io.Serializable;
import java.util.HashSet;
import java.util.concurrent.TimeUnit;
/**
* This class filters duplicates that occur within a configurable time of each other in a data stream.
*/
public class DedupeFilterFunction<T, K extends Serializable> extends RichFilterFunction<T> implements CheckpointedAsynchronously<HashSet<K>> {
private LoadingCache<K, Boolean> dedupeCache;
private final KeySelector<T, K> keySelector;
private final long cacheExpirationTimeMs;
/**
* #param cacheExpirationTimeMs The expiration time for elements in the cache
*/
public DedupeFilterFunction(KeySelector<T, K> keySelector, long cacheExpirationTimeMs){
this.keySelector = keySelector;
this.cacheExpirationTimeMs = cacheExpirationTimeMs;
}
#Override
public void open(Configuration parameters) throws Exception {
createDedupeCache();
}
#Override
public boolean filter(T value) throws Exception {
K key = keySelector.getKey(value);
boolean seen = dedupeCache.get(key);
if (!seen) {
dedupeCache.put(key, true);
return true;
} else {
return false;
}
}
#Override
public HashSet<K> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
return new HashSet<>(dedupeCache.asMap().keySet());
}
#Override
public void restoreState(HashSet<K> state) throws Exception {
createDedupeCache();
for (K key : state) {
dedupeCache.put(key, true);
}
}
private void createDedupeCache() {
dedupeCache = CacheBuilder.newBuilder()
.expireAfterWrite(cacheExpirationTimeMs, TimeUnit.MILLISECONDS)
.build(new CacheLoader<K, Boolean>() {
#Override
public Boolean load(K k) throws Exception {
return false;
}
});
}
}

Save and Read Key-Value pair in Spark

I have a JavaPairRDD in the following format:
JavaPairRDD< String, Tuple2< String, List< String>>> myData;
I want to save it as a Key-Value format (String, Tuple2< String, List< String>>).
myData.saveAsXXXFile("output-path");
So my next job could read in the data directly to my JavaPairRDD:
JavaPairRDD< String, Tuple2< String, List< String>>> newData = context.XXXFile("output-path");
I am using Java 7, Spark 1.2, Java API. I tried saveAsTextFile and saveAsObjectFile, neither works. And I don't see saveAsSequenceFile option in my eclipse.
Does anyone have any suggestion for this problem?
Thank you very much!

You could use SequenceFileRDDFunctions that is used through implicits in scala, however that might be nastier than using the usual suggestion for java of:
myData.saveAsHadoopFile(fileName, Text.class, CustomWritable.class,
SequenceFileOutputFormat.class);
implementing CustomWritable via extending
org.apache.hadoop.io.Writable
Something like this should work (did not check for compilation):
public class MyWritable extends Writable{
private String _1;
private String[] _2;
public MyWritable(Tuple2<String, String[]> data){
_1 = data._1;
_2 = data._2;
}
public Tuple2<String, String[]> get(){
return new Tuple2(_1, _2);
}
#Override
public void readFields(DataInput in) throws IOException {
_1 = WritableUtils.readString(in);
ArrayWritable _2Writable = new ArrayWritable();
_2Writable.readFields(in);
_2 = _2Writable.toStrings();
}
#Override
public void write(DataOutput out) throws IOException {
Text.writeString(out, _1);
ArrayWritable _2Writable = new ArrayWritable(_2);
_2Writable.write(out);
}
}
such that it fits your data model.

Anonymous class do not have an argument

I am learning Apache Spark. Given such an implementation of spark using java below, I am confused about some details about it.
public class JavaWordCount {
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaWordCount <master> <file>");
System.exit(1);
}
JavaSparkContext ctx = new JavaSparkContext(args[0], "JavaWordCount",
System.getenv("SPARK_HOME"), System.getenv("SPARK_EXAMPLES_JAR"));
JavaRDD<String> lines = ctx.textFile(args[1], 1);
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
JavaPairRDD<String, Integer> ones = words.map(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
List<Tuple2<String, Integer>> output = counts.collect();
for (Tuple2 tuple : output) {
System.out.println(tuple._1 + ": " + tuple._2);
}
System.exit(0);
}
}
According to my comprehension, begin in line 12, it passed an anonymous class FlatMapFunction into the lines.flatMap() as an argument. Then what does the String s mean? It seems that it doesn't pass an created String s as an argument, then how will the FlatMapFunction<String,String>(){} class works since no specific arguments are passed into?

The anonymous class instance you're passing is overriding the call(String s) method. Whatever is receiving this anonymous class instance is something that wants to make use of that call() method during its execution: it will be (somehow) constructing strings and passing them (directly or indirectly) to the call() method of whatever you've passed in.
So the fact that you're not invoking the method you've defined isn't a worry: something else is doing so.
This is a common use case for anonymous inner classes. A method m() expects to be passed something that implements the Blah interface, and the Blah interface has a frobnicate(String s) method in it. So we call it with
m(new Blah() {
public void frobnicate(String s) {
//exciting code goes here to do something with s
}
});
and the m method will now be able to take this instance that implements Blah, and invoke frobnicate() on it.
Perhaps m looks like this:
public void m(Blah b) {
b.frobnicate("whatever");
}
Now the frobnicate() method that we wrote in our inner class is being invoked, and as it runs, the parameter s will be set to "whatever".

All your are doing here is passing a FlatMapFunction as argument to the flatMap method; your passed FlatMapFunction overrides call(String s):
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>()
{
public Iterable<String> call(String s)
{
return Arrays.asList(s.split(" "));
}
});
The code implementing lines.flatMap could look like this for instance:
public JavaRDD<String> flatMap(FlatMapFunction<String, String> map)
{
String str = "some string";
Iterable<String> it = map.call(str);
// do stuff with 'it'
// return a JavaRDD<String>
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark load a csv into JavaPairRDD by key found in row - java

Related

Apache Flink Process xml and write them to database

Spark UDF written in Java Lambda raises ClassCastException

apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream?

Save and Read Key-Value pair in Spark

Anonymous class do not have an argument

Categories

Resources