Join a JavaRDD with a JavaPairRDD

Join a JavaRDD with a JavaPairRDD - java

I've got two files, one with anagrafic data (ID, name, last name) and another one with bank operations (ID, amount, idPerson).
I extracted two JavaRDDs: one regarding the people, another one regarding the total amount of each persons' operations (through a reduceByKey).
How can I create a new JavaPairRDD<Integer, Subject> where the Integer is the amount and the Subject is the person?
I tried this but didn't work:
JavaRDD<String> pLines = jsc.textFile("operations.csv").filter(x->!x.contains("ID"));
JavaRDD<String> pLines2 = jsc.textFile("anagraphic.txt").filter(x->!x.contains("\"ID\""));
JavaRDD<Soggetto> pSoggetti = pLines2.map(new EstraiSoggetti());
JavaPairRDD<Integer, Integer> pIDSubjectAmount = pTransazioni.mapToPair((x)->new Tuple2<Integer,Integer>(x.subject, x.amount));
JavaPairRDD<Integer, Transazione> pTransazioni2 = pLines.mapToPair(new EstraiTransazioniPair());
JavaPairRDD<Integer, Integer> pFrequencies2 = pIDSubjectAmount.reduceByKey(new Sum());
JavaPairRDD<Integer, Tuple2<Transazione, Soggetto>> pSoggettiTransazioni = pTransazioni2.join(pSoggetti2);
List<Tuple2<Integer, Soggetto>> list = pSoggetti2.collect();
My functions used for extraction
public class EstraiSoggetti implements Function<String, Soggetto> {
public Soggetto call(String line) throws Exception {
String [] fields = line.split(";");
return new Soggetto(Integer.parseInt(fields[0]), fields[1], fields[2]);
}
}
public class EstraiTransazioniPair implements PairFunction<String, Integer, Transazione> {
public Tuple2<Integer, Transazione> call(String line) throws Exception {
String [] fields = line.split(";");
return new Tuple2<Integer, Transazione>(Integer.parseInt(fields[2]), new Transazione(Integer.parseInt(fields[0]), Integer.parseInt(fields[1]), Integer.parseInt(fields[2]), Integer.parseInt(fields[3]), fields[4]));
}
}

Related

Apache Flink Process xml and write them to database

i have the following use case.
Xml files are written to a kafka topic which i want to consume and process via flink.
The xml attributes have to be renamed to match the database table columns. These renames have to be flexible and maintainable from outside the flink job.
At the end the attributes have to be written to the database.
Each xml document repesent a database record.
As a second step all some attributes of all xml documents from the last x minutes have to be aggregated.
As i know so far flink is capable of all the mentioned steps but i am lacking of an idea how to implement it corretly.
Currently i have implemented the kafka source, retrieve the xml document and parse it via custom MapFunction. There i create a POJO and store each attribute name and value in a HashMap.
public class Data{
private Map<String,String> attributes = HashMap<>();
}
HashMap containing:
Key: path.to.attribute.one Value: Value of attribute one
Now i would like to use the Broadcasting State to change the original attribute names to the database column names.
At this stage i stuck as i have my POJO data with the attributes inside the HashMap but i don't know how to connect it with the mapping via Broadcasting.
Another way would be to flatMap the xml document attributes in single records. This leaves me with two problems:
How to assure that attributes from one document don't get mixed with them from another document within the stream
How to merge all the attributes of one document back to insert them as one record into the database
For the second stage i am aware of the Window function even if i don't have understood it in every detail but i guess it would fit my requirement. The question on this stage would be if i can use more than one sink in one job while one would be a stream of the raw data and one of the aggregated.
Can someone help with a hint?
Cheers
UPDATE
Here is what i got so far - i simplified the code the XmlData POJO is representing my parsed xml document.
public class StreamingJob {
static Logger LOG = LoggerFactory.getLogger(StreamingJob.class);
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
XmlData xmlData1 = new XmlData();
xmlData1.addAttribute("path.to.attribute.eventName","Start");
xmlData1.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:00.000");
xmlData1.addAttribute("third.path.to.attribute.eventSource","Source1");
xmlData1.addAttribute("path.to.attribute.additionalAttribute","Lorem");
XmlData xmlData2 = new XmlData();
xmlData2.addAttribute("path.to.attribute.eventName","Start");
xmlData2.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:01.000");
xmlData2.addAttribute("third.path.to.attribute.eventSource","Source2");
xmlData2.addAttribute("path.to.attribute.additionalAttribute","First");
XmlData xmlData3 = new XmlData();
xmlData3.addAttribute("path.to.attribute.eventName","Start");
xmlData3.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:01.000");
xmlData3.addAttribute("third.path.to.attribute.eventSource","Source1");
xmlData3.addAttribute("path.to.attribute.additionalAttribute","Day");
Mapping mapping1 = new Mapping();
mapping1.addMapping("path.to.attribute.eventName","EVENT_NAME");
mapping1.addMapping("second.path.to.attribute.eventTimestamp","EVENT_TIMESTAMP");
DataStream<Mapping> mappingDataStream = env.fromElements(mapping1);
MapStateDescriptor<String, Mapping> mappingStateDescriptor = new MapStateDescriptor<String, Mapping>(
"MappingBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<Mapping>() {}));
BroadcastStream<Mapping> mappingBroadcastStream = mappingDataStream.broadcast(mappingStateDescriptor);
DataStream<XmlData> dataDataStream = env.fromElements(xmlData1, xmlData2, xmlData3);
//Convert the xml with all attributes to a stream of attribute names and values
DataStream<Tuple2<String, String>> recordDataStream = dataDataStream
.flatMap(new CustomFlatMapFunction());
//Map the attributes with the mapping information
DataStream<Tuple2<String,String>> outputDataStream = recordDataStream
.connect(mappingBroadcastStream)
.process();
env.execute("Process xml data and write it to database");
}
static class XmlData{
private Map<String,String> attributes = new HashMap<>();
public XmlData(){
}
public String toString(){
return this.attributes.toString();
}
public Map<String,String> getColumns(){
return this.attributes;
}
public void addAttribute(String key, String value){
this.attributes.put(key,value);
}
public String getAttributeValue(String attributeName){
return attributes.get(attributeName);
}
}
static class Mapping{
//First string is the attribute path and name
//Second string is the database column name
Map<String,String> mappingTuple = new HashMap<>();
public Mapping(){}
public void addMapping(String attributeNameWithPath, String databaseColumnName){
this.mappingTuple.put(attributeNameWithPath,databaseColumnName);
}
public Map<String, String> getMappingTuple() {
return mappingTuple;
}
public void setMappingTuple(Map<String, String> mappingTuple) {
this.mappingTuple = mappingTuple;
}
}
static class CustomFlatMapFunction implements FlatMapFunction<XmlData, Tuple2<String,String>> {
#Override
public void flatMap(XmlData xmlData, Collector<Tuple2< String,String>> collector) throws Exception {
for(Map.Entry<String,String> entrySet : xmlData.getColumns().entrySet()){
collector.collect(new Tuple2<>(entrySet.getKey(), entrySet.getValue()));
}
}
}
static class CustomBroadcastingFunction extends BroadcastProcessFunction {
#Override
public void processElement(Object o, ReadOnlyContext readOnlyContext, Collector collector) throws Exception {
}
#Override
public void processBroadcastElement(Object o, Context context, Collector collector) throws Exception {
}
}
}

Here's some example code of how to do this using a BroadcastStream. There's a subtle issue where the attribute remapping data might show up after one of the records. Normally you'd use a timer with state to hold onto any records that are missing remapping data, but in your case it's unclear whether a missing remapping is a "need to wait longer" or "no mapping exists". In any case, this should get you started...
private static MapStateDescriptor<String, String> REMAPPING_STATE = new MapStateDescriptor<>("remappings", String.class, String.class);
#Test
public void testUnkeyedStreamWithBroadcastStream() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(2);
List<Tuple2<String, String>> attributeRemapping = new ArrayList<>();
attributeRemapping.add(new Tuple2<>("one", "1"));
attributeRemapping.add(new Tuple2<>("two", "2"));
attributeRemapping.add(new Tuple2<>("three", "3"));
attributeRemapping.add(new Tuple2<>("four", "4"));
attributeRemapping.add(new Tuple2<>("five", "5"));
attributeRemapping.add(new Tuple2<>("six", "6"));
BroadcastStream<Tuple2<String, String>> attributes = env.fromCollection(attributeRemapping)
.broadcast(REMAPPING_STATE);
List<Map<String, Integer>> xmlData = new ArrayList<>();
xmlData.add(makePOJO("one", 10));
xmlData.add(makePOJO("two", 20));
xmlData.add(makePOJO("three", 30));
xmlData.add(makePOJO("four", 40));
xmlData.add(makePOJO("five", 50));
DataStream<Map<String, Integer>> records = env.fromCollection(xmlData);
records.connect(attributes)
.process(new MyRemappingFunction())
.print();
env.execute();
}
private Map<String, Integer> makePOJO(String key, int value) {
Map<String, Integer> result = new HashMap<>();
result.put(key, value);
return result;
}
#SuppressWarnings("serial")
private static class MyRemappingFunction extends BroadcastProcessFunction<Map<String, Integer>, Tuple2<String, String>, Map<String, Integer>> {
#Override
public void processBroadcastElement(Tuple2<String, String> in, Context ctx, Collector<Map<String, Integer>> out) throws Exception {
ctx.getBroadcastState(REMAPPING_STATE).put(in.f0, in.f1);
}
#Override
public void processElement(Map<String, Integer> in, ReadOnlyContext ctx, Collector<Map<String, Integer>> out) throws Exception {
final ReadOnlyBroadcastState<String, String> state = ctx.getBroadcastState(REMAPPING_STATE);
Map<String, Integer> result = new HashMap<>();
for (String key : in.keySet()) {
if (state.contains(key)) {
result.put(state.get(key), in.get(key));
} else {
result.put(key, in.get(key));
}
}
out.collect(result);
}
}

Transform Java List into the Pivot format

I have a below class and would like to transform the list of data objects into the pivot table format with java.
public class Data {
private String consultedOn;
private String consultedBy;
// Getters
// Setters
}
List<Data> reports = new ArrayList<Data>();
reports.add(new Data("04/12/2018","Mr.Bob"));
reports.add(new Data("04/12/2018","Mr.Jhon"));
reports.add(new Data("04/12/2018","Mr.Bob"));
reports.add(new Data("05/12/2018","Mr.Jhon"));
reports.add(new Data("06/12/2018","Mr.Bob"));
reports.add(new Data("06/12/2018","Mr.Jhon"));
reports.add(new Data("07/12/2018","Mr.Bob"));
I would like to transform the above list into the below table format with java within a collection.
consultedOn Mr.Bob Mr.Jhon
---------------------------------------
04/12/2018 2 1
05/12/2018 0 1
06/12/2018 1 1
07/12/2018 1 0
Note that the consultedOn field is not restricted to two values, this field may contain any data so that the collection should be dynamic.
I tried using Java8 streams with below code.
class DataMap {
private String consultedOn;
private String consultedBy;
public DataMap(String consultedOn) {
super();
this.consultedOn = consultedOn;
}
public DataMap(String consultedOn, String consultedBy) {
super();
this.consultedOn = consultedOn;
this.consultedBy = consultedBy;
}
public String getConsultedOn() {
return consultedOn;
}
public void setConsultedOn(String consultedOn) {
this.consultedOn = consultedOn;
}
public String getConsultedBy() {
return consultedBy;
}
public void setConsultedBy(String consultedBy) {
this.consultedBy = consultedBy;
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((consultedOn == null) ? 0 : consultedOn.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (!(obj instanceof DataMap ))
return false;
DataMap other = (DataMap )obj;
if (consultedOn == null) {
if (other.consultedOn != null)
return false;
} else if (!consultedOn.equals(other.consultedOn))
return false;
return true;
}
}
Map<DataMap, List<DataReport>> map = reports.stream()
.collect(Collectors.groupingBy(x -> new DataMap(x.getConsultedOn(), x.getConsultedBy())));
But the map is not giving intend results as per my expectations.
I'm not sure how to go-ahead with this kind of data, any help will be appreciated.

Here's a complete answer, using the technique I explained in the comment, i.e. design a class Row representing what you want to generate for each row, i.e. a consultedOn string, and a number of consultations for each person.
public class Pivot {
private static final class Data {
private final String consultedOn;
private final String consultedBy;
public Data(String consultedOn, String consultedBy) {
this.consultedOn = consultedOn;
this.consultedBy = consultedBy;
}
public String getConsultedOn() {
return consultedOn;
}
public String getConsultedBy() {
return consultedBy;
}
}
private static final class Row {
private final String consultedOn;
private final Map<String, Integer> consultationsByPerson = new HashMap<>();
public Row(String consultedOn) {
this.consultedOn = consultedOn;
}
public void addPerson(String person) {
consultationsByPerson.merge(person, 1, Integer::sum);
}
public int getConsultationsFor(String person) {
return consultationsByPerson.getOrDefault(person, 0);
}
public String getConsultedOn() {
return consultedOn;
}
}
private static class PivotReport {
private final Map<String, Row> rowsByConsultedOn = new HashMap<>();
private SortedSet<String> persons = new TreeSet<>();
private PivotReport() {}
private void addData(Data d) {
rowsByConsultedOn.computeIfAbsent(d.getConsultedOn(), Row::new).addPerson(d.getConsultedBy());
persons.add(d.consultedBy);
}
public static PivotReport create(List<Data> list) {
PivotReport report = new PivotReport();
list.forEach(report::addData);
return report;
}
public String toString() {
String headers = "Consulted on\t" + String.join("\t", persons);
String rows = rowsByConsultedOn.values()
.stream()
.sorted(Comparator.comparing(Row::getConsultedOn))
.map(this::rowToString)
.collect(Collectors.joining("\n"));
return headers + "\n" + rows;
}
private String rowToString(Row row) {
return row.getConsultedOn() + "\t" +
persons.stream()
.map(person -> Integer.toString(row.getConsultationsFor(person)))
.collect(Collectors.joining("\t"));
}
}
public static void main(String[] args) {
List<Data> list = createListOfData();
PivotReport report = PivotReport.create(list);
System.out.println(report);
}
private static List<Data> createListOfData() {
List<Data> reports = new ArrayList<Data>();
reports.add(new Data("04/12/2018","Mr.Bob"));
reports.add(new Data("04/12/2018","Mr.Jhon"));
reports.add(new Data("04/12/2018","Mr.Bob"));
reports.add(new Data("05/12/2018","Mr.Jhon"));
reports.add(new Data("06/12/2018","Mr.Bob"));
reports.add(new Data("06/12/2018","Mr.Jhon"));
reports.add(new Data("07/12/2018","Mr.Bob"));
reports.add(new Data("07/12/2018","Mr.Smith"));
return reports;
}
}
Note that since you're using String instead of LocalDate for the consultedOn field, the dates will be sorted lexicographically instead of being sorted chronologically. You should use the appropriate type: LocalDate.

You are probably looking to use Collectors.groupingBy to group the List<DataMap> by consultedOn and further grouping it by consultedBy attribute and their count as :
Map<String, Map<String, Long>> finalMapping = reports.stream()
.collect(Collectors.groupingBy(DataMap::getConsultedOn,
Collectors.groupingBy(DataMap::getConsultedBy,Collectors.counting())));
This would provide you as an output:
{05/12/2018={Mr.Jhon=1}, 06/12/2018={Mr.Jhon=1, Mr.Bob=1},
07/12/2018={Mr.Bob=1}, 04/12/2018={Mr.Jhon=1, Mr.Bob=2}}
Further, if you require all the corresponding consultedBy values to be accounted in, you can create a Set of those from the initial List<DataMap> as :
Set<String> consultedBys = reports.stream()
.map(DataMap::getConsultedBy)
.collect(Collectors.toSet());
using which you can modify your existing map obtained to contain 0 count as well in the following manner:
finalMapping.forEach((k, v) -> consultedBys.forEach(c -> v.putIfAbsent(c, 0L)));
This would now provide you as the output:
{05/12/2018={Mr.Jhon=1, Mr.Bob=0}, 06/12/2018={Mr.Jhon=1, Mr.Bob=1},
07/12/2018={Mr.Jhon=0, Mr.Bob=1}, 04/12/2018={Mr.Jhon=1, Mr.Bob=2}}

The other would be like this:
Map<Pair<String, String>, Integer> map = reports
.stream()
.collect(toMap(data -> new Pair<>(data.getConsultedOn(),
data.getConsultedBy()), data -> 1, Integer::sum));
Map<String, DataMap> result= new HashMap<>();
-
class DataMap {
private String consultedOn;
private Map<String, Integer> map;
}
-
Set<String> persons = new HashSet<>();
persons = reports.stream().map(Data::getConsultedBy).collect(Collectors.toSet());
-
for (Map.Entry<Pair<String, String>, Integer> entry : map.entrySet()) {
Map<String, Integer> val = new HashMap<>();
for (String person : persons) {
if (!person.equals(entry.getKey().getValue()))
val.put(person, 0);
else
val.put(entry.getKey().getValue(), entry.getValue());
}
result.put(entry.getKey().getKey(), new DataMap(entry.getKey().getKey(), val));
}
and final result:
List<DataMap> finalResult = new ArrayList<>(result.values());

Instead of using a separate data structure, you can use a Map of key as consultedOn (date or String) and have the value as a list of (String or your own defined POJO with overridden equals() method.Here in I have used a map like Map<String, List<String>>
All you need is the two methods:
one to set report (addDataToReport) : for each consultedOn (key), create a list of doctors consulted . See comments for map.merge usage
and one to display the data in a report manner (printReport). We are using "%10s" to give proper formatting. Instead of println, format doesn't implicitly append a new line character
Moreover to get the report's column we need to have a set (unique value list), doctors.add(consultedBy); will serve us for this purpose . Java will take care of keeping the doctors' value unique.
public class Application {
Set<String> doctors = new LinkedHashSet<>();
private void addDataToReport(Map<String, List<String>> reportMap, String consultedOn, String consultedBy) {
doctors.add(consultedBy); // set the doctors Set
reportMap.merge(consultedOn, Arrays.asList(consultedBy)// if key = consultedOn is not there add , a new list
, (v1, v2) -> Stream.concat(v1.stream(), v2.stream()).collect(Collectors.toList()));//else merge previous and new values , here concatenate two lists
}
private void printReport(Map<String, List<String>> reportMap) {
/*Set Headers*/
String formatting = "%10s";//give a block of 10 characters for each string to print
System.out.format(formatting, "consultedOn");
doctors.forEach(t -> System.out.format(formatting, t));// print data on console without an implicit new line
System.out.println("\n---------------------------------------");
/*Set row values*/
for (Map.Entry<String, List<String>> entry : reportMap.entrySet()) {
Map<String, Integer> map = new LinkedHashMap<>();
doctors.forEach(t -> map.put(t, 0)); // initialise each doctor count on a day to 0
entry.getValue().forEach(t -> map.put(t, map.get(t) + 1));
System.out.format(formatting, entry.getKey());
map.values().forEach(t -> System.out.format(formatting, t));
System.out.println();
}
}
public static void main(String[] args) {
Application application = new Application();
Map<String, List<String>> reportMap = new LinkedHashMap<>();
String MR_JHON = "Mr.Jhon";
String MR_BOB = "Mr.Bob ";
application.addDataToReport(reportMap, "04/12/2018", MR_BOB);
application.addDataToReport(reportMap, "04/12/2018", MR_JHON);
application.addDataToReport(reportMap, "04/12/2018", MR_BOB);
application.addDataToReport(reportMap, "05/12/2018", MR_JHON);
application.addDataToReport(reportMap, "06/12/2018", MR_BOB);
application.addDataToReport(reportMap, "06/12/2018", MR_JHON);
application.addDataToReport(reportMap, "07/12/2018", MR_BOB);
application.printReport(reportMap);
}
}
Result
consultedOn Mr.Bob Mr.Jhon
---------------------------------------
04/12/2018 2 1
05/12/2018 0 1
06/12/2018 1 1
07/12/2018 1 0

Pushing the resultset data onto a nested hashmap with a List

I'm trying to push my resultset data onto a nested map. Honestly, I've been struggling with the logic of how to do it. Here's a sample of my resultset data,
ID Main Sub
1 Root Carrots
2 Root Beets
3 Root Turnips
4 Leafy Spinach
5 Leafy Celery
6 Fruits Apples
7 Fruits Oranges
I created a hashmap HashMap<Integer, HashMap<String, List<String>>>, in which I thought the innermap could hold the main col as key and the corresponding subs as the list of values. The outermap would contain the id as the key and the corresponding map as the value. I'm struggling to achieve this.
Any help would be appreciated.

I would suggest using a different structure.
You have unique Id's and sub, but your Main can be duplicate.
Thus I would suggest using the following structure:
HashMap>
where POJO has ID and sub.
the key of map would be main.
Thus you can easily do:
if (map.get(main)==null){
List<POJO> pojoList= new List<>();
pojolist.add(pojo);
}else{
List<POJO> pojoList=map.get(main);
pojoList.add(pojo);
}
But it ultimately depends if you need to do lookup using ID or main.

Below is the answer to your question. But the question is probably wrong. Since the ID is unique (just a guess) you're probably looking for
Map<Integer, DataObject> map = new HashMap<>();
where DataObject is a POJO containing the variabels main and sub. Adding data to such a structure is easy.
Answer to question (added to show you how Maps and Lists work):
private Map<Integer, Map<String, List<String>>> map = new HashMap<>();
public static void main(String[] args) {
new Tester().go();
}
private void go() {
add(1, "Root", "Carrots");
add(2, "Root", "Beets");
add(3, "Root", "Turnips");
add(4, "Leafy", "Spinach");
add(5, "Leafy", "Celery");
add(6, "Fruits", "Apples");
add(7, "Fruits", "Oranges");
}
private void add(int id, String main, String sub) {
if (!map.containsKey(id)) {
map.put(id, new HashMap<String, List<String>>());
}
ArrayList<String> list = new ArrayList<String>();
list.add(sub);
map.get(id).put(main, list);
}

There is no need to make nested hash maps because each row in the example is unique (each List in nested map will have only one value).
In any case here is algorithm example in Java 8 style for your particular need :
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
public class Main {
public static void main(String[] args) {
List<ResultSet> rows = new ArrayList<>();
rows.add(new ResultSet().setId(1).setMain("Root").setSub("Carrots"));
rows.add(new ResultSet().setId(2).setMain("Root").setSub("Beets"));
rows.add(new ResultSet().setId(3).setMain("Root").setSub("Turnips"));
rows.add(new ResultSet().setId(4).setMain("Leafy").setSub("Spinach"));
rows.add(new ResultSet().setId(5).setMain("Leafy").setSub("Celery"));
rows.add(new ResultSet().setId(6).setMain("Fruits").setSub("Apples"));
rows.add(new ResultSet().setId(7).setMain("Fruits").setSub("Oranges"));
HashMap<Integer, HashMap<String, List<String>>> result = new HashMap<>();
rows.forEach(row -> {
HashMap<String, List<String>> subsByMain = result.getOrDefault(row.getId(), new HashMap<>());
List<String> subs = subsByMain.getOrDefault(row.getMain(), new ArrayList<>());
subs.add(row.getSub());
subsByMain.put(row.getMain(), subs);
result.put(row.getId(), subsByMain);
});
}
static class ResultSet {
private Integer id;
private String main;
private String sub;
Integer getId() {
return id;
}
ResultSet setId(Integer id) {
this.id = id;
return this;
}
String getMain() {
return main;
}
ResultSet setMain(String main) {
this.main = main;
return this;
}
String getSub() {
return sub;
}
ResultSet setSub(String sub) {
this.sub = sub;
return this;
}
}
}

Mapping RDD with several comma separated fields in Spark

I am new to Spark and I am going over a tutorial where a line with several fields is parsed with Scala, the code with scala is like this:
val pass = lines.map(_.split(",")).
map(pass=>(pass(15),pass(7).toInt)).
reduceByKey(_+_)
where pass is data recevied from socketTextStream (its SparkStreams). I am new to Spark and want to use Java to have the same result. I have decalared JavaReceiverInputDStream using:
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
I came up with two possible solutions:
using flatMap:
JavaDStream<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
#Override public Iterable<String> call(String x) {
return Arrays.asList(x.split(","));
}
});
But it doesn't seem right since the result is breaking the CSV to words without any order.
Using map (compilation error), This looks like the appropriate solution but I am not able to extract the fields 15 and 7 using:
JavaDStream<List<String>> words = lines.map(
new Function<String, List<String>>() {
public List<String> call(String s) {
return Arrays.asList(s.split(","));
}
});
This idea fails when i try to map List<String> => Tuple2<String, Int>
The mapping code is:
JavaPairDStream<String, Integer> pairs = words.map(
new PairFunction<List<String>, String, Integer>() {
public Tuple2<String, Integer> call(List<String> s) throws Exception {
return new Tuple2(s.get(15), 6);
}
});
The error:
method map in class
org.apache.spark.streaming.api.java.AbstractJavaDStreamLike`<T,This,R>` cannot be applied to given types;
[ERROR] required: org.apache.spark.api.java.function.Function`<java.util.List`<java.lang.String>`,R>`
[ERROR] found: `<anonymous org.apache.spark.api.java.function.PairFunction`<java.util.List`<java.lang.String>`,java.lang.String,java.lang.Integer>`>`
[ERROR] reason: no instance(s) of type variable(s) R exist so that argument type `<anonymous org.apache.spark.api.java.function.PairFunction`<java.util.List`<java.lang.String>`,java.lang.String,java.lang.Integer>`>` conforms to formal parameter type org.apache.spark.api.java.
Any suggestions on this?

Use this code. It will get require field from String.
JavaDStream<String> lines = { ..... };
JavaPairDStream<String, Integer> pairs = lines.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String t) throws Exception {
String[] words = t.split(",");
return new Tuple2<String, Integer>(words[15],Integer.parseInt(words[7]));
}
});

My method is too specific. How can I make it more generic?

I have a class, the outline of which is basically listed below.
import org.apache.commons.math.stat.Frequency;
public class WebUsageLog {
private Collection<LogLine> logLines;
private Collection<Date> dates;
WebUsageLog() {
this.logLines = new ArrayList<LogLine>();
this.dates = new ArrayList<Date>();
}
SortedMap<Double, String> getFrequencyOfVisitedSites() {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> domains = new HashSet<String>();
Frequency freq = new Frequency();
for (LogLine line : this.logLines) {
freq.addValue(line.getVisitedDomain());
domains.add(line.getVisitedDomain());
}
for (String domain : domains) {
frequencyMap.put(freq.getPct(domain), domain);
}
return frequencyMap;
}
}
The intention of this application is to allow our Human Resources folks to be able to view Web Usage Logs we send to them. However, I'm sure that over time, I'd like to be able to offer the option to view not only the frequency of visited sites, but also other members of LogLine (things like the frequency of assigned categories, accessed types [text/html, img/jpeg, etc...] filter verdicts, and so on). Ideally, I'd like to avoid writing individual methods for compilation of data for each of those types, and they could each end up looking nearly identical to the getFrequencyOfVisitedSites() method.
So, my question is twofold: first, can you see anywhere where this method should be improved, from a mechanical standpoint? And secondly, how would you make this method more generic, so that it might be able to handle an arbitrary set of data?

This is basically the same thing as Eugene's solution, I just left all the frequency calculation stuff in the original method and use the strategy only for getting the field to work on.
If you don't like enums you could certainly do this with an interface instead.
public class WebUsageLog {
private Collection<LogLine> logLines;
private Collection<Date> dates;
WebUsageLog() {
this.logLines = new ArrayList<LogLine>();
this.dates = new ArrayList<Date>();
}
SortedMap<Double, String> getFrequency(LineProperty property) {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> values = new HashSet<String>();
Frequency freq = new Frequency();
for (LogLine line : this.logLines) {
freq.addValue(property.getValue(line));
values.add(property.getValue(line));
}
for (String value : values) {
frequencyMap.put(freq.getPct(value), value);
}
return frequencyMap;
}
public enum LineProperty {
VISITED_DOMAIN {
#Override
public String getValue(LogLine line) {
return line.getVisitedDomain();
}
},
CATEGORY {
#Override
public String getValue(LogLine line) {
return line.getCategory();
}
},
VERDICT {
#Override
public String getValue(LogLine line) {
return line.getVerdict();
}
};
public abstract String getValue(LogLine line);
}
}
Then given an instance of WebUsageLog you could call it like this:
WebUsageLog usageLog = ...
SortedMap<Double, String> visitedSiteFrequency = usageLog.getFrequency(VISITED_DOMAIN);
SortedMap<Double, String> categoryFrequency = usageLog.getFrequency(CATEGORY);

I'd introduce an abstraction like "data processor" for each computation type, so you can just call individual processors for each line:
...
void process(Collection<Processor> processors) {
for (LogLine line : this.logLines) {
for (Processor processor : processors) {
processor.process();
}
}
for (Processor processor : processors) {
processor.complete();
}
}
...
public interface Processor {
public void process(LogLine line);
public void complete();
}
public class FrequencyProcessor implements Processor {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> domains = new HashSet<String>();
Frequency freq = new Frequency();
public void process(LogLine line)
String property = getProperty(line);
freq.addValue(property);
domains.add(property);
}
protected String getProperty(LogLine line) {
return line.getVisitedDomain();
}
public void complete()
for (String domain : domains) {
frequencyMap.put(freq.getPct(domain), domain);
}
}
}
You could also change a LogLine API to be more like a Map, i.e. instead of strongly typed line.getVisitedDomain() could use line.get("VisitedDomain"), then you can write a generic FrequencyProcessor for all properties and just pass a property name in its constructor.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Join a JavaRDD with a JavaPairRDD - java

Related

Apache Flink Process xml and write them to database

Transform Java List into the Pivot format

Pushing the resultset data onto a nested hashmap with a List

Mapping RDD with several comma separated fields in Spark

My method is too specific. How can I make it more generic?

Categories

Resources