Apache Spark merge after updateStateByKey()

Apache Spark merge after updateStateByKey() - java

I'm trying to merge two streams and one of them should be stateful (like static data with not frequent updates):
SparkConf conf = new SparkConf().setAppName("Test Application").setMaster("local[*]");
JavaStreamingContext context = new JavaStreamingContext(conf, Durations.seconds(10));
context.checkpoint(".");
JavaDStream<String> dataStream = context.socketTextStream("localhost", 9998);
JavaDStream<String> refDataStream = context.socketTextStream("localhost", 9999);
JavaPairDStream<String, String> pairDataStream = dataStream.mapToPair(e -> {
String[] tmp = e.split(" ");
return new Tuple2<>(tmp[0], tmp[1]);
});
JavaPairDStream<String, String> pairRefDataStream = refDataStream.mapToPair(e -> {
String[] tmp = e.split(" ");
return new Tuple2<>(tmp[0], tmp[1]);
}).updateStateByKey((Function2<List<String>, Optional<String>, Optional<String>>) (strings, stringOptional) -> {
if (!strings.isEmpty()) {
return Optional.of(strings.get(0));
}
return Optional.absent();
});
pairDataStream.join(pairRefDataStream).print();
context.start();
context.awaitTermination();
When I write 1 aaa into the first stream and 1 111 into the second immediately everything works fine, I see result of the merge. But, when I write 1 bbb into the first stream after one minute I see nothing.
Do I understand correctly what updateStateByKey() does? Or I am wrong?

updateStateByKey does exactly what you ask it for. In particular if current window contains no data (strings.isEmpty()) you instruct it to forget (return Optional.absent();):
if (!strings.isEmpty()) {
return Optional.of(strings.get(0));
}
return Optional.absent();
while what you probably want is to return previous state:
if (!strings.isEmpty()) {
return Optional.of(strings.get(0));
}
return stringOptional;

Related

How can i convert it to java stream

I am pretty new to java8 streams. I was trying to work on collection of objects using stream. But not able to achieve in precise way.
Below is the snippet which I achieved (which is giving wrong result). expected end result is List<String> of "Names email#test.com".
recordObjects is collection of object
choices = recordObjects.stream()
.filter(record -> record.getAttribute
(OneRecord.AT_RECORD_SUBMITTER_TABLE_EMAIL) != null)
.filter(record -> !record.getAttributeAsString
(OneRecord.AT_RECORD_SUBMITTER_TABLE_EMAIL).isEmpty())
.map(record -> record.getMultiValuedAttribute
(OneRecord.AT_RECORD_SUBMITTER_TABLE_EMAIL, String.class))
.flatMap(Collection::stream)
.map(email -> getFormattedEmailAddress(ATTRI_AND_RECORD_CONTACT_DEFAULT_NAME, email))
.collect(Collectors.toList());
but below is the exact logic i want to implement using streams.
for (CallerObject record : recordObjects) {
List<String> emails = record.getMultiValuedAttribute(
OneRecord.AT_RECORD_SUBMITTER_TABLE_EMAIL, String.class);
List<String> names = record.getMultiValuedAttribute(
OneRecord.AT_RECORD_SUBMITTER_TABLE_NAME, String.class);
int N = emails.size();
for (int i = 0 ; i < N ; i++) {
if(!isNullOrEmpty(emails.get(i)))
{
choices.add(getFormattedEmailAddress(isNullOrEmpty(names.get(i)) ?
ATTRI_AND_RECORD_CONTACT_DEFAULT_NAME : names.get(i) , emails.get(i)));
}
}
}

Since we don't know the getFormattedEmailAddress method, I used String.format instead to achieve the desired representation "Names email#test.com":
// the mapper function: using String.format
Function<RecordObject, String> toEmailString = r -> {
String email = record.getMultiValuedAttribute(OneRecord.AT_RECORD_SUBMITTER_TABLE_EMAIL, String.class);
String name = record.getMultiValuedAttribute(OneRecord.AT_RECORD_SUBMITTER_TABLE_NAME, String.class);
if (email != null) {
return String.format("%s %s", name, email);
} else {
return null;
}
};
choices = recordObjects.stream()
.map(toEmailString) // map to email-format or null
.filter(Objects::nonNull) // exclude null strings where no email was found
.collect(Collectors.toList());

Changed your older version code to Java 8
final Function<RecordedObject, List<String>> filteredEmail = ro -> {
final List<String> emails = ro.getMultiValuedAttribute(
OneRecord.AT_RECORD_SUBMITTER_TABLE_EMAIL, String.class);
final List<String> names = ro.getMultiValuedAttribute(
OneRecord.AT_RECORD_SUBMITTER_TABLE_NAME, String.class);
return IntStream.range(0, emails.size())
.filter(index -> !isNullOrEmpty(emails.get(index)))
.map(index -> getFormattedEmailAddress(isNullOrEmpty(names.get(index)) ?
ATTRI_AND_RECORD_CONTACT_DEFAULT_NAME : names.get(index) , emails.get(index)))
.collect(Collectors.toList());
};
recordObjects
.stream()
.map(filteredEmail)
.flatMap(Collection::stream)
.collect(Collectors.toList());

How to read CSV in Hazelcast jet in List<Map> or JsonArray format?

As I am new to hazelcast am I trying few thing but not getting result as I accepted, please help me out.
Here is my below code which I am trying but not getting success.
BatchSource<List> companyListBatchSource = FileSources.files("directory")
.glob("name.csv")
.format(FileFormat.csv(List.class))
.build();
pipeline.readFrom(companyListBatchSource)
.writeTo(Sinks.list("mapName"));
Let me know how can we read it in List<Map<String, Object>> or JsonArray?

You can pass a list of field-names if you don't want to convert the values to a dedicated record, in that case you'll get a String[] as a record.
List<String> fieldNames = new ArrayList<>();
fieldNames.add("foo");
fieldNames.add("bar");
BatchSource<String[]> source = FileSources.files("directory")
.glob("file.csv")
.format(FileFormat.csv(fieldNames))
.build();
And if you don't know the fields beforehand, you can pass null as the list of field-names.
You can also create a custom file source like below
BatchSource<Map<String, String>> source = Sources.filesBuilder("directory")
.glob("file.csv")
.build(path -> {
Stream<String> lines = Files.lines(path);
String[][] headers = new String[1][];
return lines.filter(line -> {
if (headers[0] == null) {
headers[0] = line.split(",");
return false;
}
return true;
}).map(line -> {
String[] values = line.split(",");
Map<String, String> map = new HashMap<>();
for (int i = 0; i < headers[0].length; i++) {
String header = headers[0][i];
String value = values[i];
map.put(header, value);
}
return map;
});
});

Web scraping using multithreading

I wrote a code to lookup for some movie names on IMDB, but if for instance I am searching for "Harry Potter", I will find more than one movie. I would like to use multithreading, but I don't have much knowledge on this area.
I am using strategy design pattern to search among more websites, and for instance inside one of the methods I have this code
for (Element element : elements) {
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if (movieName.matches(patternMatcher)) {
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
resultList.add(result);
}
}
which, for each element (which is the movie name), will create a new connection on IMDB to lookup for ratings and other stuff, on the super.imdbConnection(movieName) line.
The problem is, I would like to have all the connections at the same time, because on 5-6 movies found, the process will take much longer than expected.
I am not asking for code, I want some ideeas. I thought about creating an inner class which implements Runnable, and to use it, but I don't find any meaning on that.
How can I rewrite that loop to use multithreading?
I am using Jsoup for parsing, Element and Elements are from that library.

The most simple way is parallelStream()
List<Result> resultList = elements.parallelStream()
.map(e -> {
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
return result;
}else{
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());
If you don't like parallelStream() and want to use Threads, you can to this:
List<Element> elements = new ArrayList<>();
//create a function which returns an implementation of `Callable`
//input: Element
//output: Callable<Result>
Function<Element, Callable<Result>> scrapFunction = (element) -> new Callable<Result>() {
#Override
public Result call() throws Exception{
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
return result;
}else{
return null;
}
}
};
//create a fixed pool of threads
ExecutorService executor = Executors.newFixedThreadPool(elements.size());
//submit a Callable<Result> for every Element
//by using scrapFunction.apply(...)
List<Future<Result>> futures = elements.stream()
.map(e -> executor.submit(scrapFunction.apply(e)))
.collect(Collectors.toList());
//collect all results from Callable<Result>
List<Result> resultList = futures.stream()
.map(e -> {
try{
return e.get();
}catch(Exception ignored){
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());

Find duplicates in first column and take average based on third column

My issue here is I need to compute average time for each Id and compute average time of each id.
Sample data
T1,2020-01-16,11:16pm,start
T2,2020-01-16,11:18pm,start
T1,2020-01-16,11:20pm,end
T2,2020-01-16,11:23pm,end
I have written a code in such a way that I kept first column and third column in a map.. something like
T1, 11:16pm
but I could not able to compute values after keeping those values in a map. Also tried to keep them in string array and split into line by line. By same issue facing for that approach also.
**
public class AverageTimeGenerate {
public static void main(String[] args) throws IOException {
File file = new File("/abc.txt");
try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
while (true) {
String line = reader.readLine();
if (line == null) {
break;
}
ArrayList<String> list = new ArrayList<>();
String[] tokens = line.split(",");
for (String s: tokens) {
list.add(s);
}
Map<String, String> map = new HashMap<>();
String[] data = line.split(",");
String ids= data[0];
String dates = data[1];
String transactionTime = data[2];
String transactionStartAndEndTime = data[3];
String[] transactionIds = ids.split("/n");
String[] timeOfEachTransaction = transactionTime.split("/n");
for(String id : transactionIds) {
for(String time : timeOfEachTransaction) {
map.put(id, time);
}
}
}
}
}
}
Can anyone suggest me is it possible to find duplicates in a map and compute values in map, Or is there any other way I can do this so that the output should be like
`T1 2:00
T2 5:00'

I don't know what is your logic to complete the average time but you can save data in map for one particular transaction. The map structure can be like this. Transaction id will be the key and all the time will be in array list.
Map<String,List<String>> map = new HashMap<String,List<String>>();

You can do like this:
Map<String, String> result = Files.lines(Paths.get("abc.txt"))
.map(line -> line.split(","))
.map(arr -> {
try {
return new AbstractMap.SimpleEntry<>(arr[0],
new SimpleDateFormat("HH:mm").parse(arr[2]));
} catch (ParseException e) {
return null;
}
}).collect(Collectors.groupingBy(Map.Entry::getKey,
Collectors.collectingAndThen(Collectors
.mapping(Map.Entry::getValue, Collectors.toList()),
list -> toStringTime.apply(convert.apply(list)))));
for simplify I've declared two functions.
Function<List<Date>, Long> convert = list -> (list.get(1).getTime() - list.get(0).getTime()) / 2;
Function<Long, String> toStringTime = l -> l / 60000 + ":" + l % 60000 / 1000;

Sonarlint prompting to "Refactor code so that stream pipeline is used"

I was using a stream based approach to map-reduce my List<Map<String,String>> to a List<CustomObject> . The following code was used for the stream
List<Map<String,String>> mailVariable = (List<Map<String, String>>) processVariables.get("MAIL_MAP");
1| List<CustomObject> detList = mailVariable
2| .stream()
3| .flatMap(getEntry)
4| .filter (isEmpty)
5| .reduce(new ArrayList<CustomObject>(),accumulateToCustomObject,combiner);
I was analyzing my code using sonarLint and got the following error on line 2 and 3
Refactor this code so that stream pipeline is used. squid:S3958
I am infact using stream and returing the value from the terminal operation as suggested here. Is there anything I'm doing wrong ?. Could any one suggest the correct way to write this code ?
// following are the functional interface impls used in the process
Function<Map<String,String>, Stream<Entry<String,String>>> getEntry = data -> data.entrySet().stream();
Predicate<Entry<String, String>> isEmpty = data -> data.getValue() != null
|| !data.getValue().isEmpty()
|| !data.getValue().equals(" ");
BinaryOperator<ArrayList<CustomObject>> combiner = (a, b) -> {
ArrayList<CustomObject> acc = b;
acc.addAll(a);
return acc;
};
BiFunction<ArrayList<CustomObject>,Entry<String,String>,ArrayList<CustomObject>> accumulateToCustomObject = (finalList, eachset) -> {
/* reduction process happens
building the CustomObject..
*/
return finalList;
};

Update:: I have found a workaround to this problem by splitting my map-reduce operation into a map and collect operation like so. That particular lint error is showing up now
List<AlertEventLogDetTO> detList = mailVariable
.stream()
.flatMap(getEntry)
.filter (isEmpty)
.map(mapToObj)
.filter(Objects::nonNull)
.collect(Collectors.toList());
Function<Entry<String,String>,AlertEventLogDetTO> mapToObj = eachSet -> {
String tagString = null;
String tagValue = eachSet.getValue();
try{
tagString = MapVariables.valueOf(eachSet.getKey()).getTag();
} catch(Exception e){
tagString = eachSet.getKey();
}
if(eventTags.contains(tagString)){
AlertEventLogDetTO entity = new AlertEventLogDetTO();
entity.setAeldAelId(alertEventLog.getAelId());
entity.setAelTag(tagString);
entity.setAelValue(tagValue);
return entity;
}
return null;
};

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Spark merge after updateStateByKey() - java

Related

How can i convert it to java stream

How to read CSV in Hazelcast jet in List<Map> or JsonArray format?

Web scraping using multithreading

Find duplicates in first column and take average based on third column

Sonarlint prompting to "Refactor code so that stream pipeline is used"

Categories

Resources