Spark: Serialization not working with Aggregate

Spark: Serialization not working with Aggregate - java

I have this class (in Java), which I want to use in Spark (1.6):
public class Aggregation {
private Map<String, Integer> counts;
public Aggregation() {
counts = new HashMap<String, Integer>();
}
public Aggregation add(Aggregation ia) {
String key = buildCountString(ia);
addKey(key);
return this;
}
private void addKey(String key, int cnt) {
if(counts.containsKey(key)) {
counts.put(key, counts.get(key) + cnt);
}
else {
counts.put(key, cnt);
}
}
private void addKey(String key) {
addKey(key, 1);
}
public Aggregation merge(Aggregation agg) {
for(Entry<String, Integer> e: agg.counts.entrySet()) {
this.addKey(e.getKey(), e.getValue());
}
return this;
}
private String buildCountString(Aggregation rec) {
...
}
}
When starting Spark I enabled Kyro and added this class (in Scala):
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[Aggregation]))
And I want to use it with Spark aggregate like this (Scala):
rdd.aggregate(new InteractionAggregation)((agg, rec) => agg.add(rec), (a, b) => a.merge(b) )
Somehow this raises a "Task not serializable" exception.
But when I use the class with map and reduce, everything works fine:
val rdd2= interactionObjects.map( _ => new InteractionAggregation())
rdd2.reduce((a,b) => a.merge(b))
println(rdd2.count())
Do you have an idea why the error occurs with aggregate but not with map/reduce?
Thanks and regards!

Your Aggregation class should implement Serializable. When you call aggregate, the driver sends your (new Aggregation()) object to all workers, which results in a serialization error.

Related

Kafka Custom Deserializer

So far, I've been able to create a KStream with the help of a topic.
KStream<String, Object> testqa2 = builder.stream("testqa2", Consumed.with(Serdes.String(), Serdes.String()))
.mapValues(value -> {
System.out.println(value);
return value;
});
It doesn't print anything, so on debbuging - I realized I am just creating my KStream. There is no data in it.
I am having a litte trouble creating serializer/deserializer for worker class.
package com.copart.mwa.Avro;
public class Worker {
private static String WorkerActivityName;
private static String WorkerSid;
private static String WorkerPreviousActivityName;
private static String WorkerPreviousActivitySid;
public String getWorkerActivityName() {
return WorkerActivityName;
}
public void setWorkerActivityName(String workerActivityName) {
WorkerActivityName = workerActivityName;
}
public static String getWorkerSid() {
return WorkerSid;
}
public void setWorkerSid(String workerSid) {
WorkerSid = workerSid;
}
public String getWorkerPreviousActivityName() {
return WorkerPreviousActivityName;
}
public void setWorkerPreviousActivityName(String workerPreviousActivityName) {
WorkerPreviousActivityName = workerPreviousActivityName;
}
public String getWorkerPreviousActivitySid() {
return WorkerPreviousActivitySid;
}
public void setWorkerPreviousActivitySid(String workerPreviousActivitySid) {
WorkerPreviousActivitySid = workerPreviousActivitySid;
}
#Override
public String toString() {
return "Worker(" + WorkerSid + ", " + WorkerActivityName + ")";
} }
And the message from the producer to the consumer is a JSON
{
"WorkerActivityName": "Available",
"EventType": "worker.activity.update",
"ResourceType": "worker",
"WorkerTimeInPreviousActivityMs": "237",
"Timestamp": "1626114642",
"WorkerActivitySid": "WAc9030ef021bc1786d3ae11544f4d9883",
"WorkerPreviousActivitySid": "WAf4feb231e97c1878fecc58b26fdb95f3",
"WorkerTimeInPreviousActivity": "0",
"AccountSid": "AC8c5cd8c9ba538090da104b26d68a12ec",
"WorkerName": "Dorothy.Finegan#Copart.Com",
"Sid": "EV284c8a8bc27480e40865263f0b42e5cf",
"TimestampMs": "1626114642204",
"P": "WKe638256376188fab2a98cccb3c803d67",
"WorkspaceSid": "WS38b10d521442ecb74fcc263d5a4d726e",
"WorkspaceName": "Copart-MiPhone",
"WorkerPreviousActivityName": "Unavailable(RNA)",
"EventDescription": "Worker Dorothy.Finegan#Copart.Com updated to Available Activity",
"ResourceSid": "WKe638256376188fab2a98cccb3c803d67",
"WorkerAttributes": "{\"miphone_dept\":[\"USA_YRD_OPS\"],\"languages\":[\"en\"],\"home_region\":\"GL\",\"roles\":[\"supervisor\"],\"miphone_yards\":[\"81\"],\"miphone_enabled\":true,\"miphone_states\":[\"IL\"],\"home_state\":\"IL\",\"skills\":[\"YD_SELLER\",\"YD_TITLE\"],\"home_division\":\"Northern\",\"miphone_divisions\":[\"Northern\"],\"miphone_functions\":[\"outbound_only\"],\"full_name\":\"Dorothy Finegan\",\"miphone_regions\":[\"GL\"],\"home_country\":\"USA\",\"copart_user_id\":\"USA3204\",\"home_yard\":\"81\",\"home_dept\":\"USA_YRD_OPS\",\"email\":\"dorothy.finegan#copart.com\",\"home_dept_category\":\"OPS\",\"contact_uri\":\"client:Dorothy_2EFinegan_40Copart_2ECom\",\"queue_activity\":\"Available\",\"teams\":[],\"remote_employee\":false,\"miphone_call_center_units\":[\"USA_YRD_OPS|81\"],\"miphone_call_center_teams\":[]}"
}
I want to implemenet a customer deserializer where
"WorkspaceSid": "WS38b10d521442ecb74fcc263d5a4d726e", is the key and the remaining values of the other attributes act as the value for the key-value pair.
Thanks,
Anmol

It doesn't print anything
If there is data in testqa2 topic, and you have auto.offset.reset=earliest, then it should.
having a litte trouble creating serializer/deserializer for worker class
Kafka has bulit-in JSON serializers that you can build a Serde for. You don't need to make your own.
"WorkspaceSid", is the key
use selectKey, or map if you want to modify the key, not mapValues
Serializer<JsonNode> jsonNodeSerializer = new JsonSerializer();
Deserializer<JsonNode> jsonNodeDeserializer = new JsonDeserializer();
final Serde<JsonNode> jsonNodeSerde = Serdes.serdeFrom(jsonNodeSerializer,jsonNodeDeserializer);
KStream<String, JsonNode> testqa2 = builder.stream("testqa2", Consumed.with(Serdes.String(), jsonSerde))
.selectKey((k, json) -> json.get("WorkspaceSid"))
.print(Printed.toSysOut());
Alternatively, fix your producer code to get the Sid from the value, and set the key there...
If you want to use Avro, you wouldn't write a Worker class - you would generate it from an Avro schema.

Is there an expiring map in Java that expires elements after a period of time since first insertion?

I tried looking at cache mechanisms, such as Guava's Cache. Their expiration is only since last update.
What I'm looking for is a data structure that stores keys and cleans the keys after a time has passed since the first insertion. I'm planning for the value to be some counter.
A scenario might be a silent worker that does some work for the first time but keeps silent for an expiry period of time - even if work is requested. If work is requested after the expiry time has passed, he'll do the work.
Know of such a data structure? Thanks.

There are a few options for this.
Passive Removal
If it is not a requirement to clean up the expired keys as soon as they expire or at set intervals (i.e. a key does not need to be removed when the key expires or at some set interval), then PassiveExpiringMap from Apache Commons Collections is a good option. When attempting to access a key in this map, the Time to Live (TTL) of the key (all keys have the same TTL) is checked and if the key has expired, it is removed from the map and null is returned. This data structure does not have an active clean-up mechanism, so expired entries are only removed when they are accessed after the TTL corresponding to the key has expired.
Cache
If more cache-based functionality is needed (such as maximum cache capacity and add/remove listening), Google Guava provides the CacheBuilder class. This class is more complex than the Apache Commons alternative, but it also provides much more functionality. The trade-off may be worth it if this intended for more of a cache-based application.
Threaded Removal
If active removal of expired keys is needed, a separate thread can be spawn that is responsible for removing expired keys. Before looking at a possible implementation structure, it should be noted that this approach may be less performant than the above alternatives. Besides the overhead of starting a thread, the thread may cause contention with clients accessing the map. For example, if a client wants to access a key and the clean-up thread is currently removing expired keys, the client will either block (if synchronization is used) or have a different view of the map (which key-value pairs are contained in the map) if some concurrent mechanism is employed.
With that said, using this approach is complicated because it requires that the TTL be stored with the key. One approach is to create an ExpiringKey, such as (each key can have its own TTL; even if all of the keys will end up having the same TTL value, this technique removes the need to create a Map decorator or some other implementation of the Map interface):
public class ExpiringKey<T> {
private final T key;
private final long expirationTimestamp;
public ExpiringKey(T key, long ttlInMillis) {
this.key = key;
expirationTimestamp = System.currentTimeMillis() + ttlInMillis;
}
public T getKey() {
return key;
}
public boolean isExpired() {
return System.currentTimeMillis() > expirationTimestamp;
}
}
Now the type of the map would be Map<ExpiringKey<K>, V> with some specific K and V type values. The background thread can be represented using a Runnable that resembles the following:
public class ExpiredKeyRemover implements Runnable {
private final Map<ExpiringKey<?>, ?> map;
public ExpiredKeyRemover(Map<ExpiringKey<?>, ?> map) {
this.map = map;
}
#Override
public void run() {
Iterator<ExpiringKey<?>> it = map.keySet().iterator();
while (it.hasNext()) {
if (it.next().isExpired()) {
it.remove();
}
}
}
}
Then the Runnable can be started so that it executes at a fixed interval using a ScheduledExecutorService as follows (which will clean up the map every 5 seconds):
Map<ExpiringKey<K>, V> myMap = // ...
ScheduledExecutorService executor = Executors.newScheduledThreadPool(1);
executor.scheduleAtFixedRate(new ExpiredKeyRemover(myMap), 0, 5, TimeUnit.SECONDS);
It is important to note that the Map implementation used for myMap must be synchronized or allow for concurrent access. The challenge with a concurrent Map implementation is that the ExpiredKeyRemover may see a different view of the map than a client and an expired key may be returned to the client if the clean-up thread is not completed removing other keys (even if it has removed the desired/expired key since its changes may not have been committed yet). Additionally, the above key-removal code can be implemented using streams, but the above code has been used just to illustrate the logic rather than provide a performant implementation.
Hope that helps.

ExpiringMap
You can use ExpiringMap. This will remove element from map after specified time while initializing the Map . Here is the syntax
public static Map<String, Long> threatURLCacheMap = ExpiringMap.builder().expiration(5, TimeUnit.MINUTES).build();
This will create a Map in which each element will expire 5 minutes of insertion.
You can use this dependencies into your maven project net.jodah.expiringmap.
here is link to learn about it more
https://crunchify.com/how-to-use-expiringmap-maven-java-utility-to-remove-expired-objects-from-map-automatically-complete-java-tutorial/

Created a data structure. Called it DuplicateActionFilterByInsertTime.
The correct notion is filtering duplicate messages. The following class filters from insert time for some period (filterMillis).
Implementation:
public class DuplicateActionFilterByInsertTime<E extends Runnable> {
private static final Logger LOGGER = Logger.getLogger(DuplicateActionFilterByInsertTime.class.getName());
private final long filterMillis;
private final ConcurrentHashMap<E, SilenceInfoImpl> actionMap = new ConcurrentHashMap<>();
private final ConcurrentLinkedQueue<E> actionQueue = new ConcurrentLinkedQueue<>();
private final ScheduledExecutorService scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
private final AtomicBoolean purgerRegistered = new AtomicBoolean(false);
private final Set<Listener<E>> listeners = ConcurrentHashMap.newKeySet();
public DuplicateActionFilterByInsertTime(int filterMillis) {
this.filterMillis = filterMillis;
}
public SilenceInfo get(E e) {
SilenceInfoImpl insertionData = actionMap.get(e);
if (insertionData == null || insertionData.isExpired(filterMillis)) {
return null;
}
return insertionData;
}
public boolean run(E e) {
actionMap.computeIfPresent(e, (e1, insertionData) -> {
int count = insertionData.incrementAndGet();
if (count == 2) {
notifyFilteringStarted(e1);
}
return insertionData;
});
boolean isNew = actionMap.computeIfAbsent(e, e1 -> {
SilenceInfoImpl insertionData = new SilenceInfoImpl();
actionQueue.add(e1);
return insertionData;
}).getCount() == 1;
tryRegisterPurger();
if (isNew) {
e.run();
}
return isNew;
}
private void tryRegisterPurger() {
if (actionMap.size() != 0 && purgerRegistered.compareAndSet(false, true)) {
scheduledExecutorService.schedule(() -> {
try {
for (Iterator<E> iterator = actionQueue.iterator(); iterator.hasNext(); ) {
E e = iterator.next();
SilenceInfoImpl insertionData = actionMap.get(e);
if (insertionData == null || insertionData.isExpired(filterMillis)) {
iterator.remove();
}
if (insertionData != null && insertionData.isExpired(filterMillis)) {
SilenceInfoImpl removed = actionMap.remove(e);
FilteredItem<E> filteredItem = new FilteredItem<>(e, removed);
notifySilenceFinished(filteredItem);
} else {
// All the elements that were left shouldn't be purged.
break;
}
}
} finally {
purgerRegistered.set(false);
tryRegisterPurger();
}
}, filterMillis, TimeUnit.MILLISECONDS);
}
}
private void notifySilenceFinished(FilteredItem<E> filteredItem) {
new Thread(() -> listeners.forEach(l -> {
try {
l.onFilteringFinished(filteredItem);
} catch (Exception e) {
LOGGER.log(Level.WARNING, "Purge notification failed. Continuing to next one (if exists)", e);
}
})).start();
}
private void notifyFilteringStarted(final E e) {
new Thread(() -> listeners.forEach(l -> {
try {
l.onFilteringStarted(e);
} catch (Exception e1) {
LOGGER.log(Level.WARNING, "Silence started notification failed. Continuing to next one (if exists)", e1);
}
})).start();
}
public void addListener(Listener<E> listener) {
listeners.add(listener);
}
public void removeLister(Listener<E> listener) {
listeners.remove(listener);
}
public interface SilenceInfo {
long getInsertTimeMillis();
int getCount();
}
public interface Listener<E> {
void onFilteringStarted(E e);
void onFilteringFinished(FilteredItem<E> filteredItem);
}
private static class SilenceInfoImpl implements SilenceInfo {
private final long insertTimeMillis = System.currentTimeMillis();
private AtomicInteger count = new AtomicInteger(1);
int incrementAndGet() {
return count.incrementAndGet();
}
#Override
public long getInsertTimeMillis() {
return insertTimeMillis;
}
#Override
public int getCount() {
return count.get();
}
boolean isExpired(long expirationMillis) {
return insertTimeMillis + expirationMillis < System.currentTimeMillis();
}
}
public static class FilteredItem<E> {
private final E item;
private final SilenceInfo silenceInfo;
FilteredItem(E item, SilenceInfo silenceInfo) {
this.item = item;
this.silenceInfo = silenceInfo;
}
public E getItem() {
return item;
}
public SilenceInfo getSilenceInfo() {
return silenceInfo;
}
}
}
Test example: (More tests here)
#Test
public void testSimple() throws InterruptedException {
int filterMillis = 100;
DuplicateActionFilterByInsertTime<Runnable> expSet = new DuplicateActionFilterByInsertTime<>(filterMillis);
AtomicInteger purgeCount = new AtomicInteger(0);
expSet.addListener(new DuplicateActionFilterByInsertTime.Listener<Runnable>() {
#Override
public void onFilteringFinished(DuplicateActionFilterByInsertTime.FilteredItem<Runnable> filteredItem) {
purgeCount.incrementAndGet();
}
#Override
public void onFilteringStarted(Runnable runnable) {
}
});
Runnable key = () -> {
};
long beforeAddMillis = System.currentTimeMillis();
boolean added = expSet.run(key);
long afterAddMillis = System.currentTimeMillis();
Assert.assertTrue(added);
DuplicateActionFilterByInsertTime.SilenceInfo silenceInfo = expSet.get(key);
Assertions.assertThat(silenceInfo.getInsertTimeMillis()).isBetween(beforeAddMillis, afterAddMillis);
expSet.run(key);
DuplicateActionFilterByInsertTime.SilenceInfo silenceInfo2 = expSet.get(key);
Assert.assertEquals(silenceInfo.getInsertTimeMillis(), silenceInfo2.getInsertTimeMillis());
Assert.assertFalse(silenceInfo.getInsertTimeMillis() + filterMillis < System.currentTimeMillis());
Assert.assertEquals(silenceInfo.getCount(), 2);
Thread.sleep(filterMillis);
Assertions.assertThat(expSet.get(key)).isNull();
Assert.assertNull(expSet.get(key));
Thread.sleep(filterMillis * 2); // Give a chance to purge the items.
Assert.assertEquals(1, purgeCount.get());
System.out.println("Finished");
}
Source.

Best Way to Map Flat Collection to Hierarchy

My API needs to read large recordset and transform it an hierarchy (JSON) so that the UI (Angular) can display it appropriately. I am looking for an efficient way to achive this transformation (for 1000s of records).
Which Collection type is best suited? Are there any preferred mappers?
Details:
public class Batch implements Serializable {
private Timestamp deliveryDateTime;
private String deliveryLocation;
private String patientName;
// other batch details
}
I have a list of batches Collection<Batch>. When I return this collection to UI, it needs to be first sorted by deliveryDateTime, and then by deliveryLocation, and then by patientName.
The resulting JSON will look like:
{
"deliveryDateTimes": [
{
"deliveryDateTime": "Mon, 20-Nov-2017",
"deliveryLocations": [
{
"deliveryLocation": "location1",
"patients": [
{
"patientName": "LastName1, FirstName1",
"batches": [
{
"otherBatchDetails": "other batch details"
// other batch details.
},
{
"otherBatchDetails": "other batch details"
// other batch details.
}
]
}
]
}
]
}
]
}

You can try this one. I have tried and it works fine for me.
public class BatchTest {
public static void main(String[] args) {
List<Batch> sortedList = generateBatches().stream().
sorted(Comparator.comparing(Batch::getDeliveryDateTime).reversed().
thenComparing(Comparator.comparing(Batch::getDeliveryLocation).
thenComparing(Comparator.comparing(Batch::getPatientName)))).collect(Collectors.toList());
Map<Date, Map<String, Map<String, List<Batch>>>> result = sortedList.stream().
collect(Collectors.groupingBy(Batch::getDeliveryDateTime,
Collectors.groupingBy(Batch::getDeliveryLocation,
Collectors.groupingBy(Batch::getPatientName,
Collectors.toList()))));
System.out.println("Batches : " + result);
}
private static List<Batch> generateBatches() {
//DB call to fetch list of objects
}

A TreeSet collection can be used in this context. A comparator object for TreeMap is designed as follows:
class BatchSorter implements Comparator<Batch>{
#Override
public int compare(Batch b1, Batch b2) {
if(b1.getDeliveryDateTime().after(b2.getDeliveryDateTime())){
return 1;
}
else if(b1.getDeliveryDateTime().before(b2.getDeliveryDateTime())){
return -1;
}
else{ // if 2 dates are equal
if(b1.getDeliveryLocation().compareTo(b2.getDeliveryLocation())>0){
return 1;
}
else if(b1.getDeliveryLocation().compareTo(b2.getDeliveryLocation())<0){
return -1;
}
else{
return(b1.getPatientName().compareTo(b2.getPatientName())); // If location names are equal
}
}
}
}
This can be used in TreeSet as follows:
TreeSet<Batch> ts = new TreeSet<Batch>(new BatchSorter());

Use the results of two Guava ListenableFutures of different types

I have two ListenableFutures which are completed on other threads. Each future is of a different type, and I wish to use both of their results when they are both complete.
Is there an elegant way to handle this using Guava?

If you want some sort of type safety you can do the following:
class Composite {
public A a;
public B b;
}
public ListenableFuture<Composite> combine(ListenableFuture<A> futureA,
final ListenableFuture<B> futureB) {
return Futures.transform(futureA, new AsyncFunction<A, Composite>() {
public ListenableFuture<Composite> apply(final A a) throws Exception {
return Futures.transform(futureB, new Function<B, Compisite>() {
public Composite apply(B b) {
return new Composite(a, b);
}
}
}
}
}
ListenableFuture<A> futureA = ...
ListenableFuture<B> futureB = ...
ListenableFuture<Composite> result = combine(futureA, futureB);
In this case Composite can be a Pair<A, B> from Apache Commons if you like.
Also, a failure in either future will result in a failure in the resulting combined future.
Another solution would be to take a look at Trickle from the team at Spotify. The GitHub README has an example which shows a solution to a similar problem.
There are undoubtedly other solutions but this is the one that popped into my head.

Runnable listener = new Runnable() {
private boolean jobDone = false;
#Override
public synchronized void run() {
if (jobDone || !(future1.isDone() && future2.isDone())) {
return;
}
jobDone = true;
// TODO do your job
}
};
future1.addListener(listener);
future2.addListener(listener);
Not really elegant, but should do the job.
Or, more elegant, but you'll need casts:
ListenableFuture<List<Object>> composedFuture =
Futures.allAsList(future1, future2);

Since Guava v20.0 you can use:
ListenableFuture<CombinedResult> resultFuture =
Futures.whenAllSucceed(future1, future2)
.call(callableThatCombinesAndReturnsCombinedResult, executor);
Look at java docs example here

If you want some type safety, you can combine result of 2 different independent tasks by using EventBus from the sister Guava com.google.common.eventbus package
For the example sake, let's assume that one of you Futures returns Integer and the other Double.
First, create an accumulator (other names builder, collector, etc) class that you will register as an event sink with EventBus. As you can see it's really a POJO which will hanlde Integer and Double events
class Accumulator
{
Integer intResult;
Double doubleResult;
#Subscribe // This annotation makes it an event handler
public void setIntResult ( final Integer val )
{
intResult = val;
}
#Subscribe
public void setDoubleResult ( final Double val )
{
doubleResult = val;
}
}
Here is the implementation of the method that will take 2 futures and will combine them into an accumulator.
final ListenableFuture< Integer > future1 = ...;
final ListenableFuture< Double > future2 = ...;
final ImmutableList< ListenableFuture< ? extends Object> > futures =
ImmutableList.< ListenableFuture<? extends Object> >of( future1, future2 );
final ListenableFuture< Accumulator > resultFuture =
Futures.transform(
// If you don't care about failures, use allAsList
Futures.successfulAsList( futures ),
new Function< List<Object>, Accumulator > ( )
{
#Override
public Accumulator apply ( final List< Object > input )
{
final Accumulator accumulator = new Accumulator( );
final EventBus eventBus = new EventBus( );
eventBus.register( accumulator );
for ( final Object cur: input )
{
// Failed results will be set to null
if ( cur != null )
{
eventBus.post( cur );
}
}
return accumulator;
}
}
);
final Accumulator accumulator = resultFuture.get( );

Here is a simple example that would perform the addition of 2 listenable futures:
//Asynchronous call to get first value
final ListenableFuture<Integer> futureValue1 = ...;
//Take the result of futureValue1 and transform it into a function to get the second value
final AsyncFunction<Integer, Integer> getSecondValueAndSumFunction = new AsyncFunction<Integer, Integer>() {
#Override
public ListenableFuture<Integer> apply(final Integer value1) {
//Asynchronous call to get second value
final ListenableFuture<Integer> futureValue2 = ...;
//Return the sum of the values
final Function<Integer, Integer> addValuesFuture = new Function<Integer, Integer>() {
#Override
public Integer apply(Integer value2) {
Integer sum = value1 + value2;
return sum;
}
};
//Transform the second value so its value can be added to the first
final ListenableFuture<Integer> sumFuture = Futures.transform(futureValue2, addValuesFuture);
return sumFuture;
}
};
final ListenableFuture<Integer> valueOnePlusValueTwo = Futures.transform(futureValue1, getSecondValueAndSumFunction);

My method is too specific. How can I make it more generic?

I have a class, the outline of which is basically listed below.
import org.apache.commons.math.stat.Frequency;
public class WebUsageLog {
private Collection<LogLine> logLines;
private Collection<Date> dates;
WebUsageLog() {
this.logLines = new ArrayList<LogLine>();
this.dates = new ArrayList<Date>();
}
SortedMap<Double, String> getFrequencyOfVisitedSites() {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> domains = new HashSet<String>();
Frequency freq = new Frequency();
for (LogLine line : this.logLines) {
freq.addValue(line.getVisitedDomain());
domains.add(line.getVisitedDomain());
}
for (String domain : domains) {
frequencyMap.put(freq.getPct(domain), domain);
}
return frequencyMap;
}
}
The intention of this application is to allow our Human Resources folks to be able to view Web Usage Logs we send to them. However, I'm sure that over time, I'd like to be able to offer the option to view not only the frequency of visited sites, but also other members of LogLine (things like the frequency of assigned categories, accessed types [text/html, img/jpeg, etc...] filter verdicts, and so on). Ideally, I'd like to avoid writing individual methods for compilation of data for each of those types, and they could each end up looking nearly identical to the getFrequencyOfVisitedSites() method.
So, my question is twofold: first, can you see anywhere where this method should be improved, from a mechanical standpoint? And secondly, how would you make this method more generic, so that it might be able to handle an arbitrary set of data?

This is basically the same thing as Eugene's solution, I just left all the frequency calculation stuff in the original method and use the strategy only for getting the field to work on.
If you don't like enums you could certainly do this with an interface instead.
public class WebUsageLog {
private Collection<LogLine> logLines;
private Collection<Date> dates;
WebUsageLog() {
this.logLines = new ArrayList<LogLine>();
this.dates = new ArrayList<Date>();
}
SortedMap<Double, String> getFrequency(LineProperty property) {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> values = new HashSet<String>();
Frequency freq = new Frequency();
for (LogLine line : this.logLines) {
freq.addValue(property.getValue(line));
values.add(property.getValue(line));
}
for (String value : values) {
frequencyMap.put(freq.getPct(value), value);
}
return frequencyMap;
}
public enum LineProperty {
VISITED_DOMAIN {
#Override
public String getValue(LogLine line) {
return line.getVisitedDomain();
}
},
CATEGORY {
#Override
public String getValue(LogLine line) {
return line.getCategory();
}
},
VERDICT {
#Override
public String getValue(LogLine line) {
return line.getVerdict();
}
};
public abstract String getValue(LogLine line);
}
}
Then given an instance of WebUsageLog you could call it like this:
WebUsageLog usageLog = ...
SortedMap<Double, String> visitedSiteFrequency = usageLog.getFrequency(VISITED_DOMAIN);
SortedMap<Double, String> categoryFrequency = usageLog.getFrequency(CATEGORY);

I'd introduce an abstraction like "data processor" for each computation type, so you can just call individual processors for each line:
...
void process(Collection<Processor> processors) {
for (LogLine line : this.logLines) {
for (Processor processor : processors) {
processor.process();
}
}
for (Processor processor : processors) {
processor.complete();
}
}
...
public interface Processor {
public void process(LogLine line);
public void complete();
}
public class FrequencyProcessor implements Processor {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> domains = new HashSet<String>();
Frequency freq = new Frequency();
public void process(LogLine line)
String property = getProperty(line);
freq.addValue(property);
domains.add(property);
}
protected String getProperty(LogLine line) {
return line.getVisitedDomain();
}
public void complete()
for (String domain : domains) {
frequencyMap.put(freq.getPct(domain), domain);
}
}
}
You could also change a LogLine API to be more like a Map, i.e. instead of strongly typed line.getVisitedDomain() could use line.get("VisitedDomain"), then you can write a generic FrequencyProcessor for all properties and just pass a property name in its constructor.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark: Serialization not working with Aggregate - java

Your Aggregation class should implement Serializable. When you call aggregate, the driver sends your (new Aggregation()) object to all workers, which results in a serialization error.

Related

Kafka Custom Deserializer

Is there an expiring map in Java that expires elements after a period of time since first insertion?

Best Way to Map Flat Collection to Hierarchy

Use the results of two Guava ListenableFutures of different types

My method is too specific. How can I make it more generic?

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark: Serialization not working with Aggregate - java

Your Aggregation class should implement Serializable. When you call aggregate, the driver sends your (new Aggregation()) object to all workers, which results in a serialization error.

Related

Kafka Custom Deserializer

Is there an expiring map in Java that expires elements after a period of time since *first* insertion?

Best Way to Map Flat Collection to Hierarchy

Use the results of two Guava ListenableFutures of different types

My method is too specific. How can I make it more generic?

Categories

Resources

Is there an expiring map in Java that expires elements after a period of time since first insertion?