Hello,
I'm currently working on a word prediction in Java.
For this, I'm using a NGram based model, but I have some memory issues...
In a first time, I had a model like this :
public class NGram implements Serializable {
private static final long serialVersionUID = 1L;
private transient int count;
private int id;
private NGram next;
public NGram(int idP) {
this.id = idP;
}
}
But it's takes a lot of memory, so I thought I need optimization, and I thought, if I have "hello the world" and "hello the people", instead of get two ngram, I could keep in one that keep "Hello the" and then have two possibilty : "people" and "world".
To be more clear, this is my new model :
public class BNGram implements Serializable {
private static final long serialVersionUID = 1L;
private int id;
private HashMap<Integer,BNGram> next;
private int count = 1;
public BNGram(int idP) {
this.id = idP;
this.next = new HashMap<Integer, BNGram>();
}
}
But it seems that my second model consume twice more memory... I think it's because of HashMap, but I don't how to reduce this? I tried to use different Map implementations like Trove or others, but it don't change any thing.
To give you a idea, for a text of 9MB with 57818 different word (different, but it's not the total number of word), after NGram generation, my javaw process consume 1.2GB of memory...
If I save it with GZIPOutputStream, it takes arround 18MB on the disk.
So my question is : how can I do to use less memory ? Can I make something with compression (as the Serialization).
I need to add this to a other application, so I need to reduce the memory usage before...
Thanks a lot, and sorry for my bad english...
ZiMath
You need a specialized structure to achieve what you want.
Take a look at Apache's PatriciaTrie. It's like a Map, but it's memory-wise and works with Strings. It's also extremely fast: operations are O(k), with k being the number of bits of the largest key.
It has an operation that suits your immediate needs: prefixMap(), which returns a SortedMap view of the trie that contains Strings which are prefixed by the given key.
A brief usage example:
public class Patricia {
public static void main(String[] args) {
PatriciaTrie<String> trie = new PatriciaTrie<>();
String world = "hello the world";
String people = "hello the people";
trie.put(world, null);
trie.put(people, null);
SortedMap<String, String> map1 = trie.prefixMap("hello");
System.out.println(map1.keySet()); // [hello the people, hello the world]
SortedMap<String, String> map2 = trie.prefixMap("hello the w");
System.out.println(map2.keySet()); // [hello the world]
SortedMap<String, String> map3 = trie.prefixMap("hello the p");
System.out.println(map3.keySet()); // [hello the people]
}
}
There are also the tests, which contain more examples.
Here, I'm primarily trying to explain why you are observing such an excessive memory consumption, and what you could do about this (if you wanted to stick to the HashMap) :
A HashMap that is created with the default constructor will have an initial capacity of 16. This means that it will have space for 16 entries, even if it is empty. Additionally, you seem to create the map, regardless of whether it is needed or not.
So way to reduce the memory consumption in your case would be to
Create the map only when it is necessary
Create it with a smaller initial capacity
Applied to your class, this could roughly look like this:
public class BNGram {
private int id;
private Map<Integer,BNGram> next;
public BNGram(int idP) {
this.id = idP;
// (Do not create a new `Map` here!)
}
void doSomethingWhereTheMapIsNeeded(Integer key, BNGram value) {
// Create a map, when required, with an initial capacity of 1
if (next == null) {
next = new HashMap<Integer, BNGram>(1);
}
next.put(key, value);
}
}
But...
... conceptually, it is questionable to have a large "tree" structure consisting of many, many maps, each only with "few" entries. This suggests that a different data structure is more appropriate here. So you should definitely prefer a solution like the one in the answer by Magnamag, or (if this is not applicable for you, as suggested in your comments), look out for an alternative data structure - maybe even by formulating this as a new question that does not suffer from the XY Problem.
Related
I'm somewhat of a beginner to java, although I understand the basics. I believed this was the best implementation for my problem, but obviously I may be wrong. This is a mock example I made, and I'm not interested in looking for different implementations. I simply mention I'm not sure if it's the best implementation in the case that it's impossible. Regardless:
Here I have an enum, inside of which I want a map (specifically a LinkedHashMap) as one of the enum object's stored values
enum Recipe {
PANCAKES(true, new LinkedHashMap<>() ),
SANDWICH(true, new LinkedHashMap<>() ),
STEW(false, new LinkedHashMap<>() );
private final boolean tasty;
private final LinkedHashMap<String, String> directions;
// getter for directions
Recipe(boolean tasty, LinkedHashMap<String, String> directions) {
this.tasty = tasty
this.directions = directions;
}
}
However, I haven't found a way to Initialize and Populate a Map of any size in a single line
(as this would be needed for an enum)
For example, I thought this looked fine
PANCAKES(true, new LinkedHashMap<>(){{
put("Pancake Mix","Pour");
put("Water","Mix with");
put("Pan","Put mixture onto");
}};)
Until I read that this is dangerous and can cause a memory leak. Plus, it isn't the best looking code.
I also found the method:
Map.put(entry(), entry()... entry())
Which can be turned into a LinkedHashMap by passing it through its constructor:
PANCAKES(true, new LinkedHashMap<>(Map.put(entry(), ...)) );
Although I haven't found a way to ensure the insertion order is preserved, since as far as I'm aware Maps don't preserve insertion order.
Of course, there's always the option to store the LinkedHashMaps in a different place outside of the enum and simply put those in manually, but I feel like this would give me a headache managing, as I intend to add to this enum in the future.
Is there any other way to accomplish this?
to clarify, I don't literally need the code to occupy a single line, I just want the LinkedHashMap initialization and population to be written in the same place, rather than storing these things outside of the enum
Without more context, I'd say that Recipe is kind of a square peg to try to fit into the round hole of enum. In other words, in the absence of some other requirement or context that suggests an enum is best, I'd probably make it a class and expose public static final instances that can be used like enum values.
For example:
public class Recipe {
public static final Recipe PANCAKES =
new Recipe(true,
new Step("Pancake Mix","Pour"),
new Step("Water","Mix with"),
new Step("Pan","Put mixture onto")
);
public static final Recipe SANDWHICH =
new Recipe(true
// ...steps...
);
// ...more Recipes ...
#Getter
public static class Step {
private final String target;
private final String action;
private Step(String target, String action ) {
this.target = target;
this.action = action;
}
}
private final boolean tasty;
private final LinkedHashMap<String, Step> directions;
private Recipe(boolean tasty, Step... steps) {
this.tasty = tasty;
this.directions = new LinkedHashMap<>();
for (Step aStep : steps) {
directions.put(aStep.getTarget(), aStep);
}
}
}
You could also do this as anenum, where the values would be declared like this:
PANCAKES(true,
new Step("Pancake Mix","Pour"),
new Step("Water","Mix with"),
new Step("Pan","Put mixture onto")
),
SANDWHICH(true
// ...steps...
);
but like I said, this feels like a proper class as opposed to an enum.
First off, you don't really need to declare the map as a concrete implementation. If you just use Map then you will have a lot more choices.
enum Recipe {
PANCAKES(true, Map.empty()),
SANDWICH(true, Map.empty()),
STEW(false, Map.empty());
private final boolean tasty;
private final Map<String, String> directions;
// getter for directions
Recipe(boolean tasty, Map<String, String> directions) {
this.tasty = tasty
this.directions = directions;
}
}
Then, assuming you don't have more than 10 directions, you can use this form:
PANCAKES(true, Map.of(
"Pancake Mix","Pour",
"Water","Mix with",
"Pan","Put mixture onto"))
Map.of creates an immutable map, which is probably what you want for this kind of application, and should not have memory leakage issues.
Every 5 minutes, within the 20th minute cycle, I need to retrieve the data. Currently, I'm using the map data structure.
Is there a better data structure? Every time I read and set the data, I have to write to the file to prevent program restart and data loss.
For example, if the initial data in the map is:
{-1:"result1",-2:"result2",-3:"result3",-4:"result4"}
I want to get the last -4 period's value which is "result4", and set the new value "result5", so that the updated map will be:
{-1:"result5",-2:"result1",-3:"result2",-4:"result3"}
And again, I want to get the last -4 period's value which is "result3", and set the new value "result6", so the map will be:
{-1:"result6",-2:"result5",-3:"result1",-4:"result2"}
The code:
private static String getAndSaveValue(int a) {
//read the map from file
HashMap<Long,String> resultMap=getMapFromFile();
String value=resultMap.get(-4L);
for (Long i = 4L; i >= 2; i--){
resultMap.put(Long.parseLong(String.valueOf(i - 2 * i)),resultMap.get(1 - i));
}
resultMap.put(-1L,"result" + a);
//save the map to file
saveMapToFile(resultMap);
return value;
}
Based on your requirement, I think LinkedList data structure will be suitable for your requirement:
public class Test {
public static void main(String[] args) {
LinkedList<String> ls=new LinkedList<String>();
ls.push("result4");
ls.push("result3");
ls.push("result2");
ls.push("result1");
System.out.println(ls);
ls.push("result5"); //pushing new value
System.out.println("Last value:"+ls.pollLast()); //this will return `result4`
System.out.println(ls);
ls.push("result6"); //pushing new value
System.out.println("Last value:"+ls.pollLast()); // this will give you `result3`
System.out.println(ls);
}
}
Output:
[result1, result2, result3, result4]
Last value:result4
[result5, result1, result2, result3]
Last value:result3
[result6, result5, result1, result2]
Judging by your example, you need a FIFO data structure which has a bounded size.
There's no bounded general purpose implementation of the Queue interface in the JDK. Only concurrent implementation could be bounded in size. But if you're not going to use it in a multithreaded environment, it's not the best choice because thread safety doesn't come for free - concurrent collections are slower, and also can create confusing for the reader of your code.
To achieve your goal, I suggest you to use the composition by wrapping ArrayDeque, which is an array-based implementation of the Queue and performs way better than LinkedList.
Note that is a preferred approach not to extend ArrayDeque (IS A relationship) and override its methods add() and offer(), but include it in a class as a field (HAS A relationship), so that all the method calls on the instance of your class will be forwarded to the underlying collection. You can find more information regarding this approach in the item "Favor composition over inheritance" of Effective Java by Joshua Bloch.
public class BoundQueue<T> {
private Queue<T> queue;
private int limit;
public BoundQueue(int limit) {
this.queue = new ArrayDeque<>(limit);
this.limit = limit;
}
public void offer(T item) {
if (queue.size() == limit) {
queue.poll(); // or throw new IllegalStateException() depending on your needs
}
queue.add(item);
}
public T poll() {
return queue.poll();
}
public boolean isEmpty() {
return queue.isEmpty();
}
}
I am studying data skew processing in Flink and how I can change the low-level control of physical partition in order to have an even processing of tuples. I have created synthetic skewed data sources and I aim to process (aggregate) them over a window. Here is the complete code.
streamTrainsStation01.union(streamTrainsStation02)
.union(streamTicketsStation01).union(streamTicketsStation02)
// map the keys
.map(new StationPlatformMapper(metricMapper)).name(metricMapper)
.rebalance() // or .rescale() .shuffle()
.keyBy(new StationPlatformKeySelector())
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.apply(new StationPlatformRichWindowFunction(metricWindowFunction)).name(metricWindowFunction)
.setParallelism(4)
.map(new StationPlatformMapper(metricSkewedMapper)).name(metricSkewedMapper)
.addSink(new MqttStationPlatformPublisher(ipAddressSink, topic)).name(metricSinkFunction)
;
According to the Flink dashboard I could not see too much difference among .shuffle(), .rescale(), and .rebalance(). Even though the documentation says rebalance() transformation is more suitable for data skew.
After that I tried to use .partitionCustom(partitioner, "someKey"). However, for my surprise, I could not use setParallelism(4) on the window operation. The documentation says
Note: This operation is inherently non-parallel since all elements
have to pass through the same operator instance.
I did not understand why. If I am allowed to do partitionCustom, why can't I use parallelism after that? Here is the complete code.
streamTrainsStation01.union(streamTrainsStation02)
.union(streamTicketsStation01).union(streamTicketsStation02)
// map the keys
.map(new StationPlatformMapper(metricMapper)).name(metricMapper)
.partitionCustom(new StationPlatformKeyCustomPartitioner(), new StationPlatformKeySelector())
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.apply(new StationPlatformRichAllWindowFunction(metricWindowFunction)).name(metricWindowFunction)
.map(new StationPlatformMapper(metricSkewedMapper)).name(metricSkewedMapper)
.addSink(new MqttStationPlatformPublisher(ipAddressSink, topic)).name(metricSinkFunction)
;
Thanks,
Felipe
I got an answer from FLink-user-mail list. Basically using keyBy() after rebalance() is killing all effect that rebalance() is trying to do. The first (ad-hoc) solution that I found is to create a composite key that cares about the skewed key.
public class CompositeSkewedKeyStationPlatform implements Serializable {
private static final long serialVersionUID = -5960601544505897824L;
private Integer stationId;
private Integer platformId;
private Integer skewParameter;
}
I use it on the map function before use keyBy().
public class StationPlatformSkewedKeyMapper
extends RichMapFunction<MqttSensor, Tuple2<CompositeSkewedKeyStationPlatform, MqttSensor>> {
private SkewParameterGenerator skewParameterGenerator;
public StationPlatformSkewedKeyMapper() {
this.skewParameterGenerator = new SkewParameterGenerator(10);
}
#Override
public Tuple2<CompositeSkewedKeyStationPlatform, MqttSensor> map(MqttSensor value) throws Exception {
Integer platformId = value.getKey().f2;
Integer stationId = value.getKey().f4;
Integer skewParameter = 0;
if (stationId.equals(new Integer(2)) && platformId.equals(new Integer(3))) {
skewParameter = this.skewParameterGenerator.getNextItem();
}
CompositeSkewedKeyStationPlatform compositeKey = new CompositeSkewedKeyStationPlatform(stationId, platformId,
skewParameter);
return Tuple2.of(compositeKey, value);
}
}
here is my complete solution.
I'm learning about Sets and Maps in the Introduction to Java Programming book by Daniel Liang. My professor has assigned a problem in the back of the chapter that asks me to create a program that:
Queries the user for input on name
Queries the user for gender
Using these two criteria, and this/these website(s): http://cs.armstrong.edu/liang/data/babynamesranking2001.txt
... http://cs.armstrong.edu/liang/data/babynamesranking2010.txt
I have to be able to get the ranking.
I'm supposed to get this information into an array of 10 maps.
Each map corresponds with a .txt file/year. This is where I'm having problems with. How do I do this?
The (Int) rank of the student is the value of the map, and the key is the name (String) of the baby.
The way I was thinking was to create an array of maps or maybe a list of them. So like:
List<Map<Int, String>> or <Map<Int, String>[] myArray;
Yet even after that the issue of how I get all of this information from the .txt file to my maps is a hard one for me.
This is what I've come up so far. I can't say I'm happy with it. It doesn't even work when I try to start reading information is because I haven't specified the size of my array.
public class BabyNamesAndPopularity
{
public static void main (String[] args) throws IOException
{
Map<Integer, String>[] arrayOfMaps;
String myURL = "cs.armstrong.edu/liang/data/babynamesranking2001.txt";
java.net.URL url = new java.net.URL(myURL);
Scanner urlInput = new Scanner (url.openStream());
while(urlInput.hasNext())
{
...
}
}
}
Would it be viable to make a set OF MAPS? I was kind of thinking it would be better to make a set OF maps because of the fact that sets expand as needed (according to the load factor). I just need some general guidance. Unfortunately the CS program at my university (Francis Marion University in Florence, SC) is VERY small and we don't have any tutors for this stuff.
This answer rather vague, because of broad nature of question, and it may be more suitable for
programmers SE site. Still, you may find these two points worth something.
Instead of thinking in terms of 'raw' compound collections, such as lists of maps of sets or such, try to invent set of domain types, which would reflect your problem domain, and, as the next step, implement these types using suitable Java collections or arrays.
Unit-testing and incremental refinement. Instead of immediately starting with access to remote data (via java.net.URL), start with static source of data. Idea here is to have 'reliable' and easily accessible input data hand, which would allow you to write unit tests easily and w/o access to network or even to file system, using set of domain types from 1st point, above. As you write unit tests you can invent necessary domain types/methods names in unit tests at first, then implement these types/methods, then make unit tests pass.
For example, you may start by writing following unit test (I assume you know how to organize your Java project in your IDE, so unit test(s) can be run properly):
public class SingleFileProcessingTest {
private static String[] fileRawData;
#BeforeClass
public static void fillRawData() {
fileRawData = new String[2];
// values are from my head, resembling format from links you've posted
fileRawData[0] = "Jacob\t20000\tEmily\t19999";
fileRawData[1] = "Michael\t18000\tMadison\t17000";
}
#Test
public void test() {
Rankings rankings = new Rankings();
rankings.process(fileRawData);
assertEquals("Jacob", rankings.getTop().getName());
assertEquals("Madison", rankings.getScorerOfPosition(4).getName());
assertEquals(18000, rankings.getScoreOf("Michael"));
assertEquals(4, rankings.getSize());
}
}
Of course, this won't even compile -- you need to type in code of Rankings class, code of class returned by getTop() or getScorerOfPosition(int) and so on. After you made this compile, you'll need to make test pass. But you get main idea here -- domain types and incremental refinement. And easily verifiable code w/o dependencies on file system or network. Just plain old java objects (POJOs). Code for working with external data sources can be added later on, after you get your POJOs right and make tests, which cover most parts of your use cases, pass.
UPDATE Actually, I've mixed up levels of abstraction in code above: proper Rankings class should not process raw data, this is better to be done in separate class, say, RankingsDataParser. With that, unit test, renamed to RankingsProcessingTest, will be:
public class RankingsProcessingTest {
#Test
public void test() {
Rankings rankings = new Rankings();
rankings.addScorer(new Scorer("Jacob", 20000));
rankings.addScorer(new Scorer("Emily", 19999));
rankings.addScorer(new Scorer("Michael", 18000));
rankings.addScorer(new Scorer("Madison", 17000));
assertEquals("Jacob", rankings.getTop().getName());
// assertEquals("Madison", rankings.getScorerOfPosition(4).getName());
// implementation of getScorerOfPosition(int) left as exercise :)
assertEquals(18000, rankings.getScoreOf("Michael"));
assertEquals(4, rankings.getSize());
}
}
With following initial implementation of Rankings and Scorer, this is actually compiles and passes:
class Scorer {
private final String name;
private final int rank;
Scorer(String name, int rank) {
this.name = name;
this.rank = rank;
}
public String getName() {
return name;
}
public int getRank() {
return rank;
}
}
class Rankings {
private final HashMap<String, Scorer> scorerByName = new HashMap<>();
private Scorer topScorer;
public Scorer getTop() {
return topScorer;
}
public void addScorer(Scorer scorer) {
if (scorerByName.get(scorer.getName()) != null)
throw new IllegalArgumentException("This version does not support duplicate names of scorers!");
if (topScorer == null || scorer.getRank() > topScorer.getRank()) {
topScorer = scorer;
}
scorerByName.put(scorer.getName(), scorer);
}
public int getSize() {
return scorerByName.size();
}
public int getScoreOf(String scorerName) {
return scorerByName.get(scorerName).getRank();
}
}
And unit test for parsing of raw data will start with following (how to download raw data should be responsibility of yet another class, to be developed and tested separately):
public class SingleFileProcessingTest {
private static String[] fileRawData;
#BeforeClass
public static void fillRawData() {
fileRawData = new String[2];
// values are from my head
fileRawData[0] = "Jacob\t20000\tEmily\t19999";
fileRawData[1] = "Michael\t18000\tMadison\t17000";
}
#Test
public void test() {
// uncomment, make compile, make pass
/*
RankingsDataParser parser = new RankingsDataParser();
parser.parse(fileRawData);
Rankings rankings = parser.getParsedRankings();
assertNotNull(rankings);
*/
}
}
I have an object like:
class House {
String name;
List<Door> doors;
}
what I want to do is to tranform a List<House> to a List<Door> containing all doors of all houses.
Is there a chance to do this with guava?
I tried with guava used Lists.transform function but i only getting a List<List<Door>> as result.
If you really need to use a functional approach, you can do this using FluentIterable#transformAndConcat:
public static ImmutableList<Door> foo(List<House> houses) {
return FluentIterable
.from(houses)
.transformAndConcat(GetDoorsFunction.INSTANCE)
.toImmutableList();
}
private enum GetDoorsFunction implements Function<House, List<Door>> {
INSTANCE;
#Override
public List<Door> apply(House input) {
return input.getDoors();
}
}
FluentIterable.from(listOfHouses).transformAndConcat(doorFunction)
would do the job just fine.
You don't need Guava for it (assuming I understood you correctly):
final List<Door> doorsFromAllHouses = Lists.newArrayList();
for (final House house : houses) {
doorsFromAllHouses.addAll(house.doors);
}
// and doorsFromAllHouses is full of all kinds of doors from various houses
Using Lists.transform for input list of houses and transform function get all doors from a house gave you correct output of list of *each house's doors* (which is exactly List<List<Door>>).
More generally you want reduce / fold function instead of transform, which isn't implemented in Guava (see this issue), mostly because Java's verbose syntax and presense of for-each loop which is good enough. You'll be able to reduce in Java 8 (or you are able to do this in any other mainstream language nowadays...). Pseudo-Java8-code:
List<Door> doors = reduce(
houses, // collection to reduce from
new ArrayList<Door>(), // initial accumulator value
house, acc -> acc.addAll(house.doors)); // reducing function