Mapping RDD with several comma separated fields in Spark

Mapping RDD with several comma separated fields in Spark - java

I am new to Spark and I am going over a tutorial where a line with several fields is parsed with Scala, the code with scala is like this:
val pass = lines.map(_.split(",")).
map(pass=>(pass(15),pass(7).toInt)).
reduceByKey(_+_)
where pass is data recevied from socketTextStream (its SparkStreams). I am new to Spark and want to use Java to have the same result. I have decalared JavaReceiverInputDStream using:
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
I came up with two possible solutions:
using flatMap:
JavaDStream<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
#Override public Iterable<String> call(String x) {
return Arrays.asList(x.split(","));
}
});
But it doesn't seem right since the result is breaking the CSV to words without any order.
Using map (compilation error), This looks like the appropriate solution but I am not able to extract the fields 15 and 7 using:
JavaDStream<List<String>> words = lines.map(
new Function<String, List<String>>() {
public List<String> call(String s) {
return Arrays.asList(s.split(","));
}
});
This idea fails when i try to map List<String> => Tuple2<String, Int>
The mapping code is:
JavaPairDStream<String, Integer> pairs = words.map(
new PairFunction<List<String>, String, Integer>() {
public Tuple2<String, Integer> call(List<String> s) throws Exception {
return new Tuple2(s.get(15), 6);
}
});
The error:
method map in class
org.apache.spark.streaming.api.java.AbstractJavaDStreamLike`<T,This,R>` cannot be applied to given types;
[ERROR] required: org.apache.spark.api.java.function.Function`<java.util.List`<java.lang.String>`,R>`
[ERROR] found: `<anonymous org.apache.spark.api.java.function.PairFunction`<java.util.List`<java.lang.String>`,java.lang.String,java.lang.Integer>`>`
[ERROR] reason: no instance(s) of type variable(s) R exist so that argument type `<anonymous org.apache.spark.api.java.function.PairFunction`<java.util.List`<java.lang.String>`,java.lang.String,java.lang.Integer>`>` conforms to formal parameter type org.apache.spark.api.java.
Any suggestions on this?

Use this code. It will get require field from String.
JavaDStream<String> lines = { ..... };
JavaPairDStream<String, Integer> pairs = lines.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String t) throws Exception {
String[] words = t.split(",");
return new Tuple2<String, Integer>(words[15],Integer.parseInt(words[7]));
}
});

Related

Apache Flink Process xml and write them to database

i have the following use case.
Xml files are written to a kafka topic which i want to consume and process via flink.
The xml attributes have to be renamed to match the database table columns. These renames have to be flexible and maintainable from outside the flink job.
At the end the attributes have to be written to the database.
Each xml document repesent a database record.
As a second step all some attributes of all xml documents from the last x minutes have to be aggregated.
As i know so far flink is capable of all the mentioned steps but i am lacking of an idea how to implement it corretly.
Currently i have implemented the kafka source, retrieve the xml document and parse it via custom MapFunction. There i create a POJO and store each attribute name and value in a HashMap.
public class Data{
private Map<String,String> attributes = HashMap<>();
}
HashMap containing:
Key: path.to.attribute.one Value: Value of attribute one
Now i would like to use the Broadcasting State to change the original attribute names to the database column names.
At this stage i stuck as i have my POJO data with the attributes inside the HashMap but i don't know how to connect it with the mapping via Broadcasting.
Another way would be to flatMap the xml document attributes in single records. This leaves me with two problems:
How to assure that attributes from one document don't get mixed with them from another document within the stream
How to merge all the attributes of one document back to insert them as one record into the database
For the second stage i am aware of the Window function even if i don't have understood it in every detail but i guess it would fit my requirement. The question on this stage would be if i can use more than one sink in one job while one would be a stream of the raw data and one of the aggregated.
Can someone help with a hint?
Cheers
UPDATE
Here is what i got so far - i simplified the code the XmlData POJO is representing my parsed xml document.
public class StreamingJob {
static Logger LOG = LoggerFactory.getLogger(StreamingJob.class);
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
XmlData xmlData1 = new XmlData();
xmlData1.addAttribute("path.to.attribute.eventName","Start");
xmlData1.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:00.000");
xmlData1.addAttribute("third.path.to.attribute.eventSource","Source1");
xmlData1.addAttribute("path.to.attribute.additionalAttribute","Lorem");
XmlData xmlData2 = new XmlData();
xmlData2.addAttribute("path.to.attribute.eventName","Start");
xmlData2.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:01.000");
xmlData2.addAttribute("third.path.to.attribute.eventSource","Source2");
xmlData2.addAttribute("path.to.attribute.additionalAttribute","First");
XmlData xmlData3 = new XmlData();
xmlData3.addAttribute("path.to.attribute.eventName","Start");
xmlData3.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:01.000");
xmlData3.addAttribute("third.path.to.attribute.eventSource","Source1");
xmlData3.addAttribute("path.to.attribute.additionalAttribute","Day");
Mapping mapping1 = new Mapping();
mapping1.addMapping("path.to.attribute.eventName","EVENT_NAME");
mapping1.addMapping("second.path.to.attribute.eventTimestamp","EVENT_TIMESTAMP");
DataStream<Mapping> mappingDataStream = env.fromElements(mapping1);
MapStateDescriptor<String, Mapping> mappingStateDescriptor = new MapStateDescriptor<String, Mapping>(
"MappingBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<Mapping>() {}));
BroadcastStream<Mapping> mappingBroadcastStream = mappingDataStream.broadcast(mappingStateDescriptor);
DataStream<XmlData> dataDataStream = env.fromElements(xmlData1, xmlData2, xmlData3);
//Convert the xml with all attributes to a stream of attribute names and values
DataStream<Tuple2<String, String>> recordDataStream = dataDataStream
.flatMap(new CustomFlatMapFunction());
//Map the attributes with the mapping information
DataStream<Tuple2<String,String>> outputDataStream = recordDataStream
.connect(mappingBroadcastStream)
.process();
env.execute("Process xml data and write it to database");
}
static class XmlData{
private Map<String,String> attributes = new HashMap<>();
public XmlData(){
}
public String toString(){
return this.attributes.toString();
}
public Map<String,String> getColumns(){
return this.attributes;
}
public void addAttribute(String key, String value){
this.attributes.put(key,value);
}
public String getAttributeValue(String attributeName){
return attributes.get(attributeName);
}
}
static class Mapping{
//First string is the attribute path and name
//Second string is the database column name
Map<String,String> mappingTuple = new HashMap<>();
public Mapping(){}
public void addMapping(String attributeNameWithPath, String databaseColumnName){
this.mappingTuple.put(attributeNameWithPath,databaseColumnName);
}
public Map<String, String> getMappingTuple() {
return mappingTuple;
}
public void setMappingTuple(Map<String, String> mappingTuple) {
this.mappingTuple = mappingTuple;
}
}
static class CustomFlatMapFunction implements FlatMapFunction<XmlData, Tuple2<String,String>> {
#Override
public void flatMap(XmlData xmlData, Collector<Tuple2< String,String>> collector) throws Exception {
for(Map.Entry<String,String> entrySet : xmlData.getColumns().entrySet()){
collector.collect(new Tuple2<>(entrySet.getKey(), entrySet.getValue()));
}
}
}
static class CustomBroadcastingFunction extends BroadcastProcessFunction {
#Override
public void processElement(Object o, ReadOnlyContext readOnlyContext, Collector collector) throws Exception {
}
#Override
public void processBroadcastElement(Object o, Context context, Collector collector) throws Exception {
}
}
}

Here's some example code of how to do this using a BroadcastStream. There's a subtle issue where the attribute remapping data might show up after one of the records. Normally you'd use a timer with state to hold onto any records that are missing remapping data, but in your case it's unclear whether a missing remapping is a "need to wait longer" or "no mapping exists". In any case, this should get you started...
private static MapStateDescriptor<String, String> REMAPPING_STATE = new MapStateDescriptor<>("remappings", String.class, String.class);
#Test
public void testUnkeyedStreamWithBroadcastStream() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(2);
List<Tuple2<String, String>> attributeRemapping = new ArrayList<>();
attributeRemapping.add(new Tuple2<>("one", "1"));
attributeRemapping.add(new Tuple2<>("two", "2"));
attributeRemapping.add(new Tuple2<>("three", "3"));
attributeRemapping.add(new Tuple2<>("four", "4"));
attributeRemapping.add(new Tuple2<>("five", "5"));
attributeRemapping.add(new Tuple2<>("six", "6"));
BroadcastStream<Tuple2<String, String>> attributes = env.fromCollection(attributeRemapping)
.broadcast(REMAPPING_STATE);
List<Map<String, Integer>> xmlData = new ArrayList<>();
xmlData.add(makePOJO("one", 10));
xmlData.add(makePOJO("two", 20));
xmlData.add(makePOJO("three", 30));
xmlData.add(makePOJO("four", 40));
xmlData.add(makePOJO("five", 50));
DataStream<Map<String, Integer>> records = env.fromCollection(xmlData);
records.connect(attributes)
.process(new MyRemappingFunction())
.print();
env.execute();
}
private Map<String, Integer> makePOJO(String key, int value) {
Map<String, Integer> result = new HashMap<>();
result.put(key, value);
return result;
}
#SuppressWarnings("serial")
private static class MyRemappingFunction extends BroadcastProcessFunction<Map<String, Integer>, Tuple2<String, String>, Map<String, Integer>> {
#Override
public void processBroadcastElement(Tuple2<String, String> in, Context ctx, Collector<Map<String, Integer>> out) throws Exception {
ctx.getBroadcastState(REMAPPING_STATE).put(in.f0, in.f1);
}
#Override
public void processElement(Map<String, Integer> in, ReadOnlyContext ctx, Collector<Map<String, Integer>> out) throws Exception {
final ReadOnlyBroadcastState<String, String> state = ctx.getBroadcastState(REMAPPING_STATE);
Map<String, Integer> result = new HashMap<>();
for (String key : in.keySet()) {
if (state.contains(key)) {
result.put(state.get(key), in.get(key));
} else {
result.put(key, in.get(key));
}
}
out.collect(result);
}
}

Join a JavaRDD with a JavaPairRDD

I've got two files, one with anagrafic data (ID, name, last name) and another one with bank operations (ID, amount, idPerson).
I extracted two JavaRDDs: one regarding the people, another one regarding the total amount of each persons' operations (through a reduceByKey).
How can I create a new JavaPairRDD<Integer, Subject> where the Integer is the amount and the Subject is the person?
I tried this but didn't work:
JavaRDD<String> pLines = jsc.textFile("operations.csv").filter(x->!x.contains("ID"));
JavaRDD<String> pLines2 = jsc.textFile("anagraphic.txt").filter(x->!x.contains("\"ID\""));
JavaRDD<Soggetto> pSoggetti = pLines2.map(new EstraiSoggetti());
JavaPairRDD<Integer, Integer> pIDSubjectAmount = pTransazioni.mapToPair((x)->new Tuple2<Integer,Integer>(x.subject, x.amount));
JavaPairRDD<Integer, Transazione> pTransazioni2 = pLines.mapToPair(new EstraiTransazioniPair());
JavaPairRDD<Integer, Integer> pFrequencies2 = pIDSubjectAmount.reduceByKey(new Sum());
JavaPairRDD<Integer, Tuple2<Transazione, Soggetto>> pSoggettiTransazioni = pTransazioni2.join(pSoggetti2);
List<Tuple2<Integer, Soggetto>> list = pSoggetti2.collect();
My functions used for extraction
public class EstraiSoggetti implements Function<String, Soggetto> {
public Soggetto call(String line) throws Exception {
String [] fields = line.split(";");
return new Soggetto(Integer.parseInt(fields[0]), fields[1], fields[2]);
}
}
public class EstraiTransazioniPair implements PairFunction<String, Integer, Transazione> {
public Tuple2<Integer, Transazione> call(String line) throws Exception {
String [] fields = line.split(";");
return new Tuple2<Integer, Transazione>(Integer.parseInt(fields[2]), new Transazione(Integer.parseInt(fields[0]), Integer.parseInt(fields[1]), Integer.parseInt(fields[2]), Integer.parseInt(fields[3]), fields[4]));
}
}

Anonymous class do not have an argument

I am learning Apache Spark. Given such an implementation of spark using java below, I am confused about some details about it.
public class JavaWordCount {
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaWordCount <master> <file>");
System.exit(1);
}
JavaSparkContext ctx = new JavaSparkContext(args[0], "JavaWordCount",
System.getenv("SPARK_HOME"), System.getenv("SPARK_EXAMPLES_JAR"));
JavaRDD<String> lines = ctx.textFile(args[1], 1);
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
JavaPairRDD<String, Integer> ones = words.map(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
List<Tuple2<String, Integer>> output = counts.collect();
for (Tuple2 tuple : output) {
System.out.println(tuple._1 + ": " + tuple._2);
}
System.exit(0);
}
}
According to my comprehension, begin in line 12, it passed an anonymous class FlatMapFunction into the lines.flatMap() as an argument. Then what does the String s mean? It seems that it doesn't pass an created String s as an argument, then how will the FlatMapFunction<String,String>(){} class works since no specific arguments are passed into?

The anonymous class instance you're passing is overriding the call(String s) method. Whatever is receiving this anonymous class instance is something that wants to make use of that call() method during its execution: it will be (somehow) constructing strings and passing them (directly or indirectly) to the call() method of whatever you've passed in.
So the fact that you're not invoking the method you've defined isn't a worry: something else is doing so.
This is a common use case for anonymous inner classes. A method m() expects to be passed something that implements the Blah interface, and the Blah interface has a frobnicate(String s) method in it. So we call it with
m(new Blah() {
public void frobnicate(String s) {
//exciting code goes here to do something with s
}
});
and the m method will now be able to take this instance that implements Blah, and invoke frobnicate() on it.
Perhaps m looks like this:
public void m(Blah b) {
b.frobnicate("whatever");
}
Now the frobnicate() method that we wrote in our inner class is being invoked, and as it runs, the parameter s will be set to "whatever".

All your are doing here is passing a FlatMapFunction as argument to the flatMap method; your passed FlatMapFunction overrides call(String s):
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>()
{
public Iterable<String> call(String s)
{
return Arrays.asList(s.split(" "));
}
});
The code implementing lines.flatMap could look like this for instance:
public JavaRDD<String> flatMap(FlatMapFunction<String, String> map)
{
String str = "some string";
Iterable<String> it = map.call(str);
// do stuff with 'it'
// return a JavaRDD<String>
}

How to Convert Stream Stream<HashMap<String, Object>> to HashMap Array HashMap<String, Object>[]?

I am newbie in Java 8 Streams. Please advice, how to Convert Stream Stream<HashMap<String, Object>> to HashMap Array HashMap<String, Object>[] ?
For example, I has some stream in code:
Stream<String> previewImagesURLsList = fileNames.stream();
Stream<HashMap<String, Object>> imagesStream = previewImagesURLsList
.map(new Function<String, HashMap<String, Object>>() {
#Override
public HashMap<String, Object> apply(String person) {
HashMap<String, Object> m = new HashMap<>();
m.put("dfsd", person);
return m;
}
});
How I can do something like
HashMap<String, Object>[] arr = imagesStream.toArray();
?
Sorry my bad English.

The following should work. Unfortunately, you have to suppress the unchecked warning.
#SuppressWarnings("unchecked")
HashMap<String, Object>[] arr = imagesStream.toArray(HashMap[]::new);
The expression HashMap[]::new is an array constructor reference, which is a kind of method reference. Method references provide an alternative way to implement functional interfaces. You can also use a lambda expression:
#SuppressWarnings({"unchecked", "rawtypes"})
HashMap<String, Object>[] array = stream.toArray(n -> new HashMap[n]);
Before Java 8, you would have used an anonymous inner class for that purpose.
#SuppressWarnings({"unchecked", "rawtypes"})
HashMap<String, Object>[] array = stream.toArray(new IntFunction<HashMap[]>() {
public HashMap[] apply(int n) {
return new HashMap[n];
}
});

My method is too specific. How can I make it more generic?

I have a class, the outline of which is basically listed below.
import org.apache.commons.math.stat.Frequency;
public class WebUsageLog {
private Collection<LogLine> logLines;
private Collection<Date> dates;
WebUsageLog() {
this.logLines = new ArrayList<LogLine>();
this.dates = new ArrayList<Date>();
}
SortedMap<Double, String> getFrequencyOfVisitedSites() {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> domains = new HashSet<String>();
Frequency freq = new Frequency();
for (LogLine line : this.logLines) {
freq.addValue(line.getVisitedDomain());
domains.add(line.getVisitedDomain());
}
for (String domain : domains) {
frequencyMap.put(freq.getPct(domain), domain);
}
return frequencyMap;
}
}
The intention of this application is to allow our Human Resources folks to be able to view Web Usage Logs we send to them. However, I'm sure that over time, I'd like to be able to offer the option to view not only the frequency of visited sites, but also other members of LogLine (things like the frequency of assigned categories, accessed types [text/html, img/jpeg, etc...] filter verdicts, and so on). Ideally, I'd like to avoid writing individual methods for compilation of data for each of those types, and they could each end up looking nearly identical to the getFrequencyOfVisitedSites() method.
So, my question is twofold: first, can you see anywhere where this method should be improved, from a mechanical standpoint? And secondly, how would you make this method more generic, so that it might be able to handle an arbitrary set of data?

This is basically the same thing as Eugene's solution, I just left all the frequency calculation stuff in the original method and use the strategy only for getting the field to work on.
If you don't like enums you could certainly do this with an interface instead.
public class WebUsageLog {
private Collection<LogLine> logLines;
private Collection<Date> dates;
WebUsageLog() {
this.logLines = new ArrayList<LogLine>();
this.dates = new ArrayList<Date>();
}
SortedMap<Double, String> getFrequency(LineProperty property) {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> values = new HashSet<String>();
Frequency freq = new Frequency();
for (LogLine line : this.logLines) {
freq.addValue(property.getValue(line));
values.add(property.getValue(line));
}
for (String value : values) {
frequencyMap.put(freq.getPct(value), value);
}
return frequencyMap;
}
public enum LineProperty {
VISITED_DOMAIN {
#Override
public String getValue(LogLine line) {
return line.getVisitedDomain();
}
},
CATEGORY {
#Override
public String getValue(LogLine line) {
return line.getCategory();
}
},
VERDICT {
#Override
public String getValue(LogLine line) {
return line.getVerdict();
}
};
public abstract String getValue(LogLine line);
}
}
Then given an instance of WebUsageLog you could call it like this:
WebUsageLog usageLog = ...
SortedMap<Double, String> visitedSiteFrequency = usageLog.getFrequency(VISITED_DOMAIN);
SortedMap<Double, String> categoryFrequency = usageLog.getFrequency(CATEGORY);

I'd introduce an abstraction like "data processor" for each computation type, so you can just call individual processors for each line:
...
void process(Collection<Processor> processors) {
for (LogLine line : this.logLines) {
for (Processor processor : processors) {
processor.process();
}
}
for (Processor processor : processors) {
processor.complete();
}
}
...
public interface Processor {
public void process(LogLine line);
public void complete();
}
public class FrequencyProcessor implements Processor {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> domains = new HashSet<String>();
Frequency freq = new Frequency();
public void process(LogLine line)
String property = getProperty(line);
freq.addValue(property);
domains.add(property);
}
protected String getProperty(LogLine line) {
return line.getVisitedDomain();
}
public void complete()
for (String domain : domains) {
frequencyMap.put(freq.getPct(domain), domain);
}
}
}
You could also change a LogLine API to be more like a Map, i.e. instead of strongly typed line.getVisitedDomain() could use line.get("VisitedDomain"), then you can write a generic FrequencyProcessor for all properties and just pass a property name in its constructor.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Mapping RDD with several comma separated fields in Spark - java

Related

Apache Flink Process xml and write them to database

Join a JavaRDD with a JavaPairRDD

Anonymous class do not have an argument

How to Convert Stream Stream<HashMap<String, Object>> to HashMap Array HashMap<String, Object>[]?

My method is too specific. How can I make it more generic?

Categories

Resources