Pass a map (or concurrent hashmap) in a DoFn(apache crunch)

Pass a map (or concurrent hashmap) in a DoFn(apache crunch) - java

Since there's a limit for Hadoop counter size(and we dont want to increase it for just one job), I am creating a map(Map) which will increment the key if some conditions are met(Same as counters). There is already a DoFn (returning custom made object) which is processing the data so I am interested in passing a map into it and grouping it outside based on keys.
I think concurrenthashmap might work but unable to implement the same.

Related

Pruning flink state based on attributes of stored object

Consider the following example class, instances of which are being stored in a ListState:
class BusinessObject {
long clientId;
String region;
Instant lastDealDate;
bool isActive;
}
The application requires that this object shouldn't be in the flink state if it has been 1 year since the last deal was made (lastDealDate) with a particular client and the client is not active i.e. isActive == false
What would be the proper way of going about this and letting flink know of these 2 factors so it removes those entries automatically? Currently, I read all the items in the state, clear the state and then add back the relevant ones, however this will start to take a long time as the number of clients increases and the size of the state grows large. Most of my searches online talk about using time-to-live and setting it for my state via descriptor. However, my logic can't rely on processing/event/ingestion time and I need to also check if isActive is false.
Extra info: the context is not keyed and the backend is RocksDB. The reason a ListState is used is because all of the relavant state/history as per the above conditions needs to be dumped every day.
Any suggestions?

With the RocksDB state backend, Flink can append to ListState without going through serialization/deserialization, but any read or modification other than an append is expensive because of ser/de.
You'll be better off if you can rework things so that these BusinessObjects are stored in MapState, even if you occasionally have to iterate over the entire map. Each key/value pair in the MapState will be a separate RocksDB entry, and you'll be able to individually create/update/delete them without having to go through ser/de for the entire map (unless you do have to scan it). (For what it's worth, iterating over MapState in RocksDB proceeds through the map in serialized-key-sorted order.)
MapState is only available as keyed (or broadcast) state, so this change would require you to key the stream. Using keyBy does force a network shuffle (and ser/de), so it will be expensive, but not as expensive as using ListState.

Can I use LinkedHashMap in Hazelcast?

can i somehow use linkedHashMap in Hazelcast (java spring). I need to get unique records from hazelcast shared in-memory cache but in order in which I inserted them. I found in hazelcast documentation (https://docs.hazelcast.org/docs/latest-dev/manual/html-single/) they offers distributed implementations of common data structures. But map doesnt preserves elements order and list or queue dont remove duplicite data. Do you know if i can use linkedHashMap or somehow get unique data and preserves their order?

Ordered or linked storage isn't compatible with the goals of a data grid - highly concurrent and distributed storage.
Ordered retrieval is possible. Hazelcast's Paging Predicate with a comparator would do it. Or the volume is not too high, you could retreive the entry set and sort it yourself.
The catch is, you have to provide the field to order upon.
If your data already has some sort of sequence number or timestamp that is always unique, this is easy.
If not, perhaps something like Atomic Long would do it. A getAndIncrement() would give you a unique number to use for each insert.
Watch though, this has a race condition if two or more threads insert concurrently. To solve this you'd need some sort of singleton #Service running somewhere to do the "get next seqno ; inset` step.
And if you restart the grid, the seqno in the atomic counter will need repositioned to the right place.

Is putting asynchronously different keys in an HashMap dangerous?

(Don't judge the design, be mercyful)
I have a Map<String, String> that I need to populate with sub-maps from asynchronous calls. I am using map.putAll(dataMap) to insert each sub-map into the main map.
However, the asynchronous part is making me kinf of nervous. The reason is, I know that I won't attend to insert the same key twice (it is a sure fact), but I don't know if the fact that I insert data asynchronously will trigger concurrency mechanism.
Should I use a ConcurrentHashMap to be sure, or there are no risks with inserting into a classic HashMap asynchronously because I know I won't insert the same key twice ? Or is there a third object that I don't know of, that would fit the job perfectl ?

Is a ConcurrentHashSet required if threads are using different keys?

Suppose I have a hash set of request IDs that I've sent from a client to a server. The server's response returns the request ID that I sent, which I can then remove from the hash set. This will be run in a multithreaded fashion, so multiple threads can be adding to and removing IDs from the hash set. However, since the IDs generated are unique (from a thread safe source, let's say an AtomicInteger for now that gets updated for each new request), does the HashSet need to be a ConcurrentHashSet?
I would think the only case this might cause a problem would be if the HashSet encounters collisions which may require datastructure changes to the underlying HashSet object, but it doesn't seem like this would occur in this use case.

Yes. Since the underlying array for the hash table might need to be resized for instance and also because of course IDs can collide. So having different keys will not help at all.
However, since you know that the IDs are increasing, and if you can have an upper bound on the maximum number of IDs outstanding (lets say 1000). You can work with an upper and lower bound and a fixed size array with offset indexing from the lowest key, in which case you will not need any mutexes or concurrent data structure. Such data structure is very fragile however since if you have more than your upper bound oustanding hell will break loose. So unless performance is of concern, just use the ConcurrentHashSet.

Can Hibernate return a collection of result objects OTHER than a List?

Does the Hibernate API support object result sets in the form of a collection other than a List?
For example, I have process that runs hundreds of thousands of iterations in order to create some data for a client. This process uses records from a Value table (for example) in order to create this output for each iteration.
With a List I would have to iterate through the entire list in order to find a certain value, which is expensive. I'd like to be able to return a TreeMap and specify a key programmatically so I can search the collection for the specific value I need. Can Hibernate do this for me?

I assume you are referring to the Query.list() method. If so: no, there is no way to return top-level results other than a List. If you are receiving too many results, why not issue a more constrained query to the database? If the query is difficult to constrain, you can populate your own Map with the contents of Hibernate's List and then throw away the list.

If I understand correctly, you load a bunch of data from the database to memory and then use them locally by looking for certain objects in that list.
If this is the case, I see 2 options.
Dont load all the data, but for each iteration access the database with a query returning only the specific record that you need. This will make more database queries, so it will probably bu slower, but with much less memory consumption. This solution could easily be improved by adding cache, so that most used values will be gotten fast. It will of course need some performance measurement, but I usually favor a naive solution with good caching, as the cache can implemented as a cross-concern and be very transparent to the programmer.
If you really want to load all your data in memory (which is actually a form of caching), the time to transform your data from a list to a TreeMap (or any other efficient structure) will probably be small compared to the full processing. So you could do the data transformation yourself.
As I said, in the general case, I would favor a solution with caching ...

From Java Persistence with Hibernate:
A java.util.Map can be mapped with
<map>, preserving key and value
pairs. Use a java.util.HashMap to
initialize a property.
A java.util.SortedMap can be mapped
with <map> element, and the sort
attribute can be set to either a
comparator or natural ordering for
in-memory sorting. Initialize the
collection with a java.util.TreeMap
instance.

Yes, that can be done.
However, you'll probably have to have your domain class implement Comparable; I don't think you can do it using a Comparator.
Edit:
It seems like I misunderstood the question. If you're talking about the result of an ad hoc query, then the above will not help you. It might be possible to make it work by binding an object with a TreeMap property to a database view if the query is fixed.
And of course you can always build the map yourself with very little work and processing overhead.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Pass a map (or concurrent hashmap) in a DoFn(apache crunch) - java

Related

Pruning flink state based on attributes of stored object

Can I use LinkedHashMap in Hazelcast?

Is putting asynchronously different keys in an HashMap dangerous?

Is a ConcurrentHashSet required if threads are using different keys?

Can Hibernate return a collection of result objects OTHER than a List?

Categories

Resources