What are the other options to handle skew data in Flink? - java

I am studying data skew processing in Flink and how I can change the low-level control of physical partition in order to have an even processing of tuples. I have created synthetic skewed data sources and I aim to process (aggregate) them over a window. Here is the complete code.
streamTrainsStation01.union(streamTrainsStation02)
.union(streamTicketsStation01).union(streamTicketsStation02)
// map the keys
.map(new StationPlatformMapper(metricMapper)).name(metricMapper)
.rebalance() // or .rescale() .shuffle()
.keyBy(new StationPlatformKeySelector())
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.apply(new StationPlatformRichWindowFunction(metricWindowFunction)).name(metricWindowFunction)
.setParallelism(4)
.map(new StationPlatformMapper(metricSkewedMapper)).name(metricSkewedMapper)
.addSink(new MqttStationPlatformPublisher(ipAddressSink, topic)).name(metricSinkFunction)
;
According to the Flink dashboard I could not see too much difference among .shuffle(), .rescale(), and .rebalance(). Even though the documentation says rebalance() transformation is more suitable for data skew.
After that I tried to use .partitionCustom(partitioner, "someKey"). However, for my surprise, I could not use setParallelism(4) on the window operation. The documentation says
Note: This operation is inherently non-parallel since all elements
have to pass through the same operator instance.
I did not understand why. If I am allowed to do partitionCustom, why can't I use parallelism after that? Here is the complete code.
streamTrainsStation01.union(streamTrainsStation02)
.union(streamTicketsStation01).union(streamTicketsStation02)
// map the keys
.map(new StationPlatformMapper(metricMapper)).name(metricMapper)
.partitionCustom(new StationPlatformKeyCustomPartitioner(), new StationPlatformKeySelector())
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.apply(new StationPlatformRichAllWindowFunction(metricWindowFunction)).name(metricWindowFunction)
.map(new StationPlatformMapper(metricSkewedMapper)).name(metricSkewedMapper)
.addSink(new MqttStationPlatformPublisher(ipAddressSink, topic)).name(metricSinkFunction)
;
Thanks,
Felipe

I got an answer from FLink-user-mail list. Basically using keyBy() after rebalance() is killing all effect that rebalance() is trying to do. The first (ad-hoc) solution that I found is to create a composite key that cares about the skewed key.
public class CompositeSkewedKeyStationPlatform implements Serializable {
private static final long serialVersionUID = -5960601544505897824L;
private Integer stationId;
private Integer platformId;
private Integer skewParameter;
}
I use it on the map function before use keyBy().
public class StationPlatformSkewedKeyMapper
extends RichMapFunction<MqttSensor, Tuple2<CompositeSkewedKeyStationPlatform, MqttSensor>> {
private SkewParameterGenerator skewParameterGenerator;
public StationPlatformSkewedKeyMapper() {
this.skewParameterGenerator = new SkewParameterGenerator(10);
}
#Override
public Tuple2<CompositeSkewedKeyStationPlatform, MqttSensor> map(MqttSensor value) throws Exception {
Integer platformId = value.getKey().f2;
Integer stationId = value.getKey().f4;
Integer skewParameter = 0;
if (stationId.equals(new Integer(2)) && platformId.equals(new Integer(3))) {
skewParameter = this.skewParameterGenerator.getNextItem();
}
CompositeSkewedKeyStationPlatform compositeKey = new CompositeSkewedKeyStationPlatform(stationId, platformId,
skewParameter);
return Tuple2.of(compositeKey, value);
}
}
here is my complete solution.

Related

Making an enum with stored LinkedHashMap values

I'm somewhat of a beginner to java, although I understand the basics. I believed this was the best implementation for my problem, but obviously I may be wrong. This is a mock example I made, and I'm not interested in looking for different implementations. I simply mention I'm not sure if it's the best implementation in the case that it's impossible. Regardless:
Here I have an enum, inside of which I want a map (specifically a LinkedHashMap) as one of the enum object's stored values
enum Recipe {
PANCAKES(true, new LinkedHashMap<>() ),
SANDWICH(true, new LinkedHashMap<>() ),
STEW(false, new LinkedHashMap<>() );
private final boolean tasty;
private final LinkedHashMap<String, String> directions;
// getter for directions
Recipe(boolean tasty, LinkedHashMap<String, String> directions) {
this.tasty = tasty
this.directions = directions;
}
}
However, I haven't found a way to Initialize and Populate a Map of any size in a single line
(as this would be needed for an enum)
For example, I thought this looked fine
PANCAKES(true, new LinkedHashMap<>(){{
put("Pancake Mix","Pour");
put("Water","Mix with");
put("Pan","Put mixture onto");
}};)
Until I read that this is dangerous and can cause a memory leak. Plus, it isn't the best looking code.
I also found the method:
Map.put(entry(), entry()... entry())
Which can be turned into a LinkedHashMap by passing it through its constructor:
PANCAKES(true, new LinkedHashMap<>(Map.put(entry(), ...)) );
Although I haven't found a way to ensure the insertion order is preserved, since as far as I'm aware Maps don't preserve insertion order.
Of course, there's always the option to store the LinkedHashMaps in a different place outside of the enum and simply put those in manually, but I feel like this would give me a headache managing, as I intend to add to this enum in the future.
Is there any other way to accomplish this?
to clarify, I don't literally need the code to occupy a single line, I just want the LinkedHashMap initialization and population to be written in the same place, rather than storing these things outside of the enum
Without more context, I'd say that Recipe is kind of a square peg to try to fit into the round hole of enum. In other words, in the absence of some other requirement or context that suggests an enum is best, I'd probably make it a class and expose public static final instances that can be used like enum values.
For example:
public class Recipe {
public static final Recipe PANCAKES =
new Recipe(true,
new Step("Pancake Mix","Pour"),
new Step("Water","Mix with"),
new Step("Pan","Put mixture onto")
);
public static final Recipe SANDWHICH =
new Recipe(true
// ...steps...
);
// ...more Recipes ...
#Getter
public static class Step {
private final String target;
private final String action;
private Step(String target, String action ) {
this.target = target;
this.action = action;
}
}
private final boolean tasty;
private final LinkedHashMap<String, Step> directions;
private Recipe(boolean tasty, Step... steps) {
this.tasty = tasty;
this.directions = new LinkedHashMap<>();
for (Step aStep : steps) {
directions.put(aStep.getTarget(), aStep);
}
}
}
You could also do this as anenum, where the values would be declared like this:
PANCAKES(true,
new Step("Pancake Mix","Pour"),
new Step("Water","Mix with"),
new Step("Pan","Put mixture onto")
),
SANDWHICH(true
// ...steps...
);
but like I said, this feels like a proper class as opposed to an enum.
First off, you don't really need to declare the map as a concrete implementation. If you just use Map then you will have a lot more choices.
enum Recipe {
PANCAKES(true, Map.empty()),
SANDWICH(true, Map.empty()),
STEW(false, Map.empty());
private final boolean tasty;
private final Map<String, String> directions;
// getter for directions
Recipe(boolean tasty, Map<String, String> directions) {
this.tasty = tasty
this.directions = directions;
}
}
Then, assuming you don't have more than 10 directions, you can use this form:
PANCAKES(true, Map.of(
"Pancake Mix","Pour",
"Water","Mix with",
"Pan","Put mixture onto"))
Map.of creates an immutable map, which is probably what you want for this kind of application, and should not have memory leakage issues.

Which Data Structure would be more suitable for the following task in Java?

Every 5 minutes, within the 20th minute cycle, I need to retrieve the data. Currently, I'm using the map data structure.
Is there a better data structure? Every time I read and set the data, I have to write to the file to prevent program restart and data loss.
For example, if the initial data in the map is:
{-1:"result1",-2:"result2",-3:"result3",-4:"result4"}
I want to get the last -4 period's value which is "result4", and set the new value "result5", so that the updated map will be:
{-1:"result5",-2:"result1",-3:"result2",-4:"result3"}
And again, I want to get the last -4 period's value which is "result3", and set the new value "result6", so the map will be:
{-1:"result6",-2:"result5",-3:"result1",-4:"result2"}
The code:
private static String getAndSaveValue(int a) {
//read the map from file
HashMap<Long,String> resultMap=getMapFromFile();
String value=resultMap.get(-4L);
for (Long i = 4L; i >= 2; i--){
resultMap.put(Long.parseLong(String.valueOf(i - 2 * i)),resultMap.get(1 - i));
}
resultMap.put(-1L,"result" + a);
//save the map to file
saveMapToFile(resultMap);
return value;
}
Based on your requirement, I think LinkedList data structure will be suitable for your requirement:
public class Test {
public static void main(String[] args) {
LinkedList<String> ls=new LinkedList<String>();
ls.push("result4");
ls.push("result3");
ls.push("result2");
ls.push("result1");
System.out.println(ls);
ls.push("result5"); //pushing new value
System.out.println("Last value:"+ls.pollLast()); //this will return `result4`
System.out.println(ls);
ls.push("result6"); //pushing new value
System.out.println("Last value:"+ls.pollLast()); // this will give you `result3`
System.out.println(ls);
}
}
Output:
[result1, result2, result3, result4]
Last value:result4
[result5, result1, result2, result3]
Last value:result3
[result6, result5, result1, result2]
Judging by your example, you need a FIFO data structure which has a bounded size.
There's no bounded general purpose implementation of the Queue interface in the JDK. Only concurrent implementation could be bounded in size. But if you're not going to use it in a multithreaded environment, it's not the best choice because thread safety doesn't come for free - concurrent collections are slower, and also can create confusing for the reader of your code.
To achieve your goal, I suggest you to use the composition by wrapping ArrayDeque, which is an array-based implementation of the Queue and performs way better than LinkedList.
Note that is a preferred approach not to extend ArrayDeque (IS A relationship) and override its methods add() and offer(), but include it in a class as a field (HAS A relationship), so that all the method calls on the instance of your class will be forwarded to the underlying collection. You can find more information regarding this approach in the item "Favor composition over inheritance" of Effective Java by Joshua Bloch.
public class BoundQueue<T> {
private Queue<T> queue;
private int limit;
public BoundQueue(int limit) {
this.queue = new ArrayDeque<>(limit);
this.limit = limit;
}
public void offer(T item) {
if (queue.size() == limit) {
queue.poll(); // or throw new IllegalStateException() depending on your needs
}
queue.add(item);
}
public T poll() {
return queue.poll();
}
public boolean isEmpty() {
return queue.isEmpty();
}
}

Hazelcast not working correctly with SqlPredicate and Index on optional field

We are storing complex objects in Hazelcast maps and need the possibility to search for objects not only based on the key but also on the content of these complex objects. In order to not take too large a performance hit, we are using indices on those search terms.
We are also using spring-data-hazelcast which provides repositories that allow us to use findByAbcXyz() type semantic queries. For some of the more complex queries we are using the #Query annotation (which spring-data-hazelcast internally translates to SqlPredicates).
We have now encountered an issue where under certain situations these #Query based search methods did not return any values, even if we could verify that the searched objects did in fact exist in the map.
I have managed to reproduce this issue with core hazelcast (i.e. without the use of spring-data-hazelcast).
Here is our object structure:
BetriebspunktKey.java
public class BetriebspunktKey implements Serializable {
private Integer uicLand;
private Integer nummer;
public BetriebspunktKey(final Integer uicLand, final Integer nummer) {
this.uicLand = uicLand;
this.nummer = nummer;
}
public Integer getUicLand() {
return uicLand;
}
public Integer getNummer() {
return nummer;
}
}
Betriebspunkt.java
public class Betriebspunkt implements Serializable {
private BetriebspunktKey key;
private List<BetriebspunktVersion> versionen;
public Betriebspunkt(final BetriebspunktKey key, final List<BetriebspunktVersion> versionen) {
this.key = key;
this.versionen = versionen;
}
public BetriebspunktKey getKey() {
return key;
}
}
BetriebspunktVersion.java
public class BetriebspunktVersion implements Serializable {
private List<BetriebspunktKey> zusatzbetriebspunkte;
public BetriebspunktVersion(final List<BetriebspunktKey> zusatzbetriebspunkte) {
this.zusatzbetriebspunkte = zusatzbetriebspunkte;
}
}
In my main file, I am now setting up hazelcast:
Config config = new Config();
final MapConfig mapConfig = config.getMapConfig("points");
mapConfig.addMapIndexConfig(new MapIndexConfig("versionen[any].zusatzbetriebspunkte[any].nummer", false));
HazelcastInstance instance = Hazelcast.newHazelcastInstance(config);
IMap<BetriebspunktKey, Betriebspunkt> map = instance.getMap("points");
I am also preparing my search criteria for later on:
Predicate equalPredicate = Predicates.equal("versionen[any].zusatzbetriebspunkte[any].nummer", 53090);
Predicate sqlPredicate = new SqlPredicate("versionen[any].zusatzbetriebspunkte[any].nummer=53090");
Next, I am creating two objects, one with the "full depth" of information, the other does not contain any "zusatzbetriebspunkte":
final Betriebspunkt abc = new Betriebspunkt(
new BetriebspunktKey(80, 166),
Collections.singletonList(new BetriebspunktVersion(
Collections.singletonList(new BetriebspunktKey(80, 53090))
))
);
final Betriebspunkt def = new Betriebspunkt(
new BetriebspunktKey(83, 141),
Collections.singletonList(new BetriebspunktVersion(
Collections.emptyList()
))
);
Here is, where things become interesting. If I first insert the "full" object into the map, the search using both the EqualPredicate as well as the SqlPredicate works:
map.put(abc.getKey(), abc);
map.put(def.getKey(), def);
Collection<Betriebspunkt> equalResults = map.values(equalPredicate);
Collection<Betriebspunkt> sqlResults = map.values(sqlPredicate);
assertEquals(1, equalResults.size()); // contains "abc"
assertEquals(1, sqlResults.size()); // contains "abc"
However, if I insert the objects into my map in reverse order (i.e. first the "partial" object and then the "full" one), only the EqualPredicate works correctly, the SqlPredicate returns an empty list, no matter what the content of the map or the search criteria.
map.put(abc.getKey(), abc);
map.put(def.getKey(), def);
Collection<Betriebspunkt> equalResults = map.values(equalPredicate);
Collection<Betriebspunkt> sqlResults = map.values(sqlPredicate);
assertEquals(1, equalResults.size()); // contains "abc"
assertEquals(1, sqlResults.size()); // --> this fails, it returns en empty list
What is the reason for this behaviour? It looks like a bug in the hazelcast code.
The reason for failing
After a lot of debugging, I have found the reason for this issue. The reasons can indeed be found in the hazelcast code.
When putting a value into a hazelcast map DefaultRecordStore.putInternal is called. At the end of this method DefaultRecordStore.saveIndex is called which finds the corresponding indexes and then calls Indexes.saveEntryIndex. This method iterates over each index and calls InternalIndex.saveEntryIndex (or rather its implementation IndexImpl.saveEntryIndex. The interesting part of that method are the following lines:
if (this.converter == null || this.converter == TypeConverters.NULL_CONVERTER) {
this.converter = entry.getConverter(this.attributeName);
}
Aparently each index stores a converter class when the first element is put into the map. Looking at QueryableEntry.getConverter explains what happens:
TypeConverter getConverter(String attributeName) {
Object attribute = this.getAttributeValue(attributeName);
if (attribute == null) {
return TypeConverters.NULL_CONVERTER;
} else {
AttributeType attributeType = this.extractAttributeType(attributeName, attribute);
return attributeType == null ? TypeConverters.IDENTITY_CONVERTER : attributeType.getConverter();
}
}
When first inserting the "full" object, extractAttributeType() will follow the "path" of our index definition "versionen[any].zusatzbetriebspunkte[any].nummer" and find out that nummer is an integer type, accordingly a TypeConverters.IntegerConverter will be returned and stored.
When first inserting the "partial" object, "zusatzbetriebspunkte[any]" is emtpy, and there is no way for extractAttributeType to find out what type nummer hast, it therefore returns null which means that TypeConverters.IdentityConverter is used.
Also, whenever a "full" element is inserted an entry is written into the index map using nummer as key, i.e. the index-map is of type Map.
So much for writing to the map. Let's now look at how data is read from the map. When calling map.values(predicate) we will eventually get to QueryRunner.runUsingGlobalIndexSafely which contains a line:
Collection<QueryableEntry> entries = indexes.query(predicate);
this will in turn after some boilerplate code call
Set<QueryableEntry> result = indexAwarePredicate.filter(queryContext);
For both of our predicates we will eventually get to IndexImpl.getRecords() which looks as follows:
public Set<QueryableEntry> getRecords(Comparable attributeValue) {
long timestamp = this.stats.makeTimestamp();
if (this.converter == null) {
this.stats.onIndexHit(timestamp, 0L);
return new SingleResultSet((Map)null);
} else {
Set<QueryableEntry> result = this.indexStore.getRecords(this.convert(attributeValue));
this.stats.onIndexHit(timestamp, (long)result.size());
return result;
}
}
The crucial call is this.convert(attributeValue) where attributeValue is the value of the predicate.
If we compare our two predicates, we can see that the EqualPredicate has two members:
attributeName = "versionen[any].zusatzbetriebspunkte[any].nummer"
value = {Integer} 53090
The SqlPredicate contains the initial string (which we passed to its constructor) but which at constructions was also parsed and mapped to a internal EqualPredicate (which when evaluating the predicate is eventually used and passed to getRecords() above):
sql = "versionen[any].zusatzbetriebspunkte[any].nummer=53090"
predicate = {EqualPredicate}
attributeName = "versionen[any].zusatzbetriebspunkte[any].nummer"
value = {String} "53090"
And this explains why the manually created EqualPredicate works in both cases: Its value is an integer. When passed to the converter, it does not matter whether it is the IntegerConverter or the IdentityConverter, as both will return the integer which can then be used as key in the index-map (which uses an integer as key).
With the SqlPredicate however, the value is a String. If this is passed to the IntegerConverter, it is converted to its corresponding integer value and accessing the index-map works. If it is passed to the IdentityConverter, the string is returned by the conversion and trying to access the index-map with a string will never find any results.
A possible solution
How can we solve this issue? I see several possibilities:
insert a "fully built" dummy value into our map during startup to ensure the converter is correctly initialised. While this works, it is ugly and not maintenance friendly
avoid using SqlPredicate and use the integer based EqualPredicate. This is not an option when working with spring-data-hazelcast as it always converts #Query based searches to SqlPredicates. We could of course use hazelcast directly and circumvent the spring-data wrapper but while that would work it means having two ways of accessing hazelcast which is also not very maintainable
use hazelcast's ValueExtractor class. This is the elegant solution that works both natively and using spring-data-hazelcast. I will outline what that looks like:
First we need to implement a value extractor which returns all zusatzbetriebspunkte of our Betriebspunkt in a form suitable for us
public class BetriebspunktExtractor extends ValueExtractor<Betriebspunkt, String> implements Serializable {
#Override
public void extract(final Betriebspunkt betriebspunkt, final String argument, final ValueCollector valueCollector) {
betriebspunkt.getVersionen().stream()
.map(BetriebspunktVersion::getZusatzbetriebspunkte)
.flatMap(List::stream)
.map(zbp -> zbp.getUicLand() + "_" + zbp.getNummer())
.forEach(valueCollector::addObject);
}
}
You'll notice that I am not only returning the nummer field but also include the uicLand field this is something we really wanted but couldn't get working using the "...[any]..." notation. We could of course only return the nummer if we wanted the exact same behavior as outlined above.
Now we need to modify our hazelcast configuration slightly:
Config config = new Config();
final MapConfig mapConfig = config.getMapConfig("points");
//mapConfig.addMapIndexConfig(new MapIndexConfig("versionen[any].zusatzbetriebspunkte[any].nummer", false));
mapConfig.addMapIndexConfig(new MapIndexConfig("zusatzbetriebspunkt", false));
mapConfig.addMapAttributeConfig(new MapAttributeConfig("zusatzbetriebspunkt", BetriebspunktExtractor.class.getName()));
You'll notice that the "long" index definition using the "...[any]..." notation is no longer needed.
Now we can use this "pseudo attribute" to query our values and it doesn't matter in which order the objects have been added to the map:
Predicate keyPredicate = Predicates.equal("zusatzbetriebspunkt", "80_53090");
Collection<Betriebspunkt> keyResults = map.values(keyPredicate);
assertEquals(1, keyResults.size()); // always contains "abc"
And in our spring-data-hazelcast repository we can now do this:
#Query("zusatzbetriebspunkt=%d_%d")
List<StammdatenBetriebspunkt> findByZusatzbetriebspunkt(Integer uicLand, Integer nummer);
If you do not need to use spring-data-hazelcast, instead of returning a string to the ValueCollector, you could return the BetriebspunktKey directly and then use it in the predicate as well. That would be the cleanest solution:
public class BetriebspunktExtractor extends ValueExtractor<Betriebspunkt, String> implements Serializable {
#Override
public void extract(final Betriebspunkt betriebspunkt, final String argument, final ValueCollector valueCollector) {
betriebspunkt.getVersionen().stream()
.map(BetriebspunktVersion::getZusatzbetriebspunkte)
.flatMap(List::stream)
//.map(zbp -> zbp.getUicLand() + "_" + zbp.getNummer())
.forEach(valueCollector::addObject);
}
}
and then
Predicate keyPredicate = Predicates.equal("zusatzbetriebspunkt", new BetriebspunktKey(80, 53090));
However, for this to work, BetriebspunktKey needs to implement Comparable and must also provide its own equals and hashCode methods.

Large number of Object in Java (with HashMap)

Hello,
I'm currently working on a word prediction in Java.
For this, I'm using a NGram based model, but I have some memory issues...
In a first time, I had a model like this :
public class NGram implements Serializable {
private static final long serialVersionUID = 1L;
private transient int count;
private int id;
private NGram next;
public NGram(int idP) {
this.id = idP;
}
}
But it's takes a lot of memory, so I thought I need optimization, and I thought, if I have "hello the world" and "hello the people", instead of get two ngram, I could keep in one that keep "Hello the" and then have two possibilty : "people" and "world".
To be more clear, this is my new model :
public class BNGram implements Serializable {
private static final long serialVersionUID = 1L;
private int id;
private HashMap<Integer,BNGram> next;
private int count = 1;
public BNGram(int idP) {
this.id = idP;
this.next = new HashMap<Integer, BNGram>();
}
}
But it seems that my second model consume twice more memory... I think it's because of HashMap, but I don't how to reduce this? I tried to use different Map implementations like Trove or others, but it don't change any thing.
To give you a idea, for a text of 9MB with 57818 different word (different, but it's not the total number of word), after NGram generation, my javaw process consume 1.2GB of memory...
If I save it with GZIPOutputStream, it takes arround 18MB on the disk.
So my question is : how can I do to use less memory ? Can I make something with compression (as the Serialization).
I need to add this to a other application, so I need to reduce the memory usage before...
Thanks a lot, and sorry for my bad english...
ZiMath
You need a specialized structure to achieve what you want.
Take a look at Apache's PatriciaTrie. It's like a Map, but it's memory-wise and works with Strings. It's also extremely fast: operations are O(k), with k being the number of bits of the largest key.
It has an operation that suits your immediate needs: prefixMap(), which returns a SortedMap view of the trie that contains Strings which are prefixed by the given key.
A brief usage example:
public class Patricia {
public static void main(String[] args) {
PatriciaTrie<String> trie = new PatriciaTrie<>();
String world = "hello the world";
String people = "hello the people";
trie.put(world, null);
trie.put(people, null);
SortedMap<String, String> map1 = trie.prefixMap("hello");
System.out.println(map1.keySet()); // [hello the people, hello the world]
SortedMap<String, String> map2 = trie.prefixMap("hello the w");
System.out.println(map2.keySet()); // [hello the world]
SortedMap<String, String> map3 = trie.prefixMap("hello the p");
System.out.println(map3.keySet()); // [hello the people]
}
}
There are also the tests, which contain more examples.
Here, I'm primarily trying to explain why you are observing such an excessive memory consumption, and what you could do about this (if you wanted to stick to the HashMap) :
A HashMap that is created with the default constructor will have an initial capacity of 16. This means that it will have space for 16 entries, even if it is empty. Additionally, you seem to create the map, regardless of whether it is needed or not.
So way to reduce the memory consumption in your case would be to
Create the map only when it is necessary
Create it with a smaller initial capacity
Applied to your class, this could roughly look like this:
public class BNGram {
private int id;
private Map<Integer,BNGram> next;
public BNGram(int idP) {
this.id = idP;
// (Do not create a new `Map` here!)
}
void doSomethingWhereTheMapIsNeeded(Integer key, BNGram value) {
// Create a map, when required, with an initial capacity of 1
if (next == null) {
next = new HashMap<Integer, BNGram>(1);
}
next.put(key, value);
}
}
But...
... conceptually, it is questionable to have a large "tree" structure consisting of many, many maps, each only with "few" entries. This suggests that a different data structure is more appropriate here. So you should definitely prefer a solution like the one in the answer by Magnamag, or (if this is not applicable for you, as suggested in your comments), look out for an alternative data structure - maybe even by formulating this as a new question that does not suffer from the XY Problem.

Many readers but not when writer available with HashMap java

I've crawl through many question regarding this area but my question still remains with me. I'm seeking some elaborate answer as well(If you kind enough?). So i could understand this more clearly and community as well.
This is my question. I have this map.
private static volatile Map<Integer, Type> types;
and have static getter as,
static Type getType(final int id)
{
if (types == null)
{
synchronized (CLASSNAME.class)
{
if (types == null)
{
types = new HashMap<Integer, Type>();
....add items to the map
}
}
}
return types.get(id);
}
Problem in this code is first thread can initialize the types so it won't be null anymore. While first thread adding values to map second thread can retrieve data from it. That means corrupted data.
I see that this can be avoid by synchronizing whole method but then multiple readers is not possible. It's an one time construction for that map and there will be no modification. So multiple readers is essential.
Also we can use Collections.synchronizeMap but if i'm correct it also not allowing concurrent readers. I tried but ConcurrentHashMap doesn't solve this either. Maybe due to it's independent partition locking behavior.
Simply what i need is no reading until map created fully and then multiple read should be possible.
Anyone got a solution?
Thanks.
There is a simple solution to your problem. Use a temporary variable, so that the reference types is null as long as the map is not completely populated. If you change the code in that way, it is thread-safe and quite efficient.
static Type getType(final int id) {
if (types == null) {
synchronized (CLASSNAME.class) {
if (types == null) {
HashMap<Integer, Type> temp = new HashMap<>();
// populate temp
types = temp;
}
}
}
return types.get(id);
}
Thread-safe, lazy and efficient initialization is a frequently required feature. Unfortunately, it's not directly supported by Java, neither by the programming language nor by the standard library. Instead, there are different patterns, and your implementation is known as Double-checked locking.
A short excursion to C++: C++11 has support for lazy, thread-safe initialization both in the language and in the library. If there is only one global type mapping, you can write the following in C++:
auto populated_map()
{
std::map<int, type> result;
// ... populate map
return result;
}
auto get_type(int id) -> const type&
{
static const std::map<int, type> map = populated_map();
return map.find(id)->second;
}
If you need lazy initialization per object, you can use the library support around std::once_flag and std::call_once:
class types
{
private:
std::once_flag _flag;
std::map<int, type> _map;
public:
auto get_type(int id) -> const type&
{
std::call_once(_flag, [this] { _map = populated_map(); });
return _map.find(id)->second;
}
};
Take a look into the Memoization pattern. There are specific implementations available in Java 8 but if you aren't adopting that soon, look at Guava's MapMaker, specifically:
private final ConcurrentMap<Map<Integer, Type> types = new MapMaker()
.makeComputingMap(new Function<Integer, Type>() {
public Graph apply(Type key) {
return loadForType(key);
}
});
In this case, no one thread will be populating this map (it may be that a single thread does). The idea is, when a thread enters it will check to see if a value for any Integer is available. If not it will run the function once, if it is, it will return it while not blocking

Categories