Java: storing a big map in resources

Java: storing a big map in resources - java

I need to use a big file that contains String,String pairs and because I want to ship it with a JAR, I opted to include a serialized and gzipped version in the resource folder of the application. This is how I created the serialization:
ObjectOutputStream out = new ObjectOutputStream(
new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(OUT_FILE_PATH, false))));
out.writeObject(map);
out.close();
I chose to use a HashMap<String,String>, the resulting file is 60MB and the map contains about 4 million entries.
Now when I need the map and I deserialize it using:
final InputStream in = FileUtils.getResource("map.ser.gz");
final ObjectInputStream ois = new ObjectInputStream(new BufferedInputStream(new GZIPInputStream(in)));
map = (Map<String, String>) ois.readObject();
ois.close();
this takes about 10~15 seconds. Is there a better way to store such a big map in a JAR? I ask because I also use the Stanford CoreNLP library which uses big model files itself but seems to perform better in that regard. I tried to locate the code where the model files are read but gave up.

Your problem is you zipped the data. Store it plain text.
The performance hit is most probably in unzipping the stream. Jars are already zipped, so there's no space saving storing the file zipped.
Basically:
Store the file in plain text
Use Files.lines(Paths.get("myfilenane.txt")) to stream the lines
Consume each line with minimal code
Something like this, assuming data is in form key=value (like a Properties file):
Map<String, String> map = new HashMap<>();
Files.lines(Paths.get("myfilenane.txt"))
.map(s -> s.split("="))
.forEach(a -> map.put(a[0], a[1]));
Disclaimer: Code may not compile or work as it was thumbed in on my phone (but there's a reasonable chance it will work)

What you can do is to apply a technique coming from the book Java Performance: The definitive guide from Scott Oaks which actually stores the zipped content of the object into a byte array so for this we need a wrapper class that I call here MapHolder:
public class MapHolder implements Serializable {
// This will contain the zipped content of my map
private byte[] content;
// My actual map defined as transient as I don't want to serialize its
// content but its zipped content
private transient Map<String, String> map;
public MapHolder(Map<String, String> map) {
this.map = map;
}
private void writeObject(ObjectOutputStream out) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try (GZIPOutputStream zip = new GZIPOutputStream(baos);
ObjectOutputStream oos = new ObjectOutputStream(
new BufferedOutputStream(zip))) {
oos.writeObject(map);
}
this.content = baos.toByteArray();
out.defaultWriteObject();
// Clear the temporary field content
this.content = null;
}
private void readObject(ObjectInputStream in) throws IOException,
ClassNotFoundException {
in.defaultReadObject();
try (ByteArrayInputStream bais = new ByteArrayInputStream(content);
GZIPInputStream zip = new GZIPInputStream(bais);
ObjectInputStream ois = new ObjectInputStream(
new BufferedInputStream(zip))) {
this.map = (Map<String, String>) ois.readObject();
// Clean the temporary field content
this.content = null;
}
}
public Map<String, String> getMap() {
return this.map;
}
}
Your code will then simply be:
final ByteArrayInputStream in = new ByteArrayInputStream(
Files.readAllBytes(Paths.get("/tmp/map.ser"))
);
final ObjectInputStream ois = new ObjectInputStream(in);
MapHolder holder = (MapHolder) ois.readObject();
map = holder.getMap();
ois.close();
As you may have noticed, you don't zip anymore the content it is zipped internally while serializing the MapHolder instance.

You could consider one of many fast serialization libraries:
protobuf (https://github.com/google/protobuf)
flat buffers (https://google.github.io/flatbuffers/)
cap'n proto (https://capnproto.org)

Related

Strings in downloadfile weird symbols

I've got a String array that contains the content for a downloadable file. I am converting it to a Stream for the download but there are some random values in the downloadfile. I don't know if it is due to the encoding and if yes, how can I change it?
var downloadButton = new DownloadLink(btn, "test.csv", () -> {
try {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
ObjectOutputStream objectOutputStream = new ObjectOutputStream(byteArrayOutputStream);
for (int i = 0; i < downloadContent.size(); i++) {
objectOutputStream.writeUTF(downloadContent.get(i));
}
objectOutputStream.flush();
objectOutputStream.close();
byte[] byteArray = byteArrayOutputStream.toByteArray();
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(byteArray);
ObjectInputStream objectInputStream = new ObjectInputStream(byteArrayInputStream);
objectInputStream.close();
return new ByteArrayInputStream(byteArray);
This is the DownloadLink class.
public class DownloadLink extends Anchor {
public DownloadLink(Button button, String fileName, InputStreamFactory fileProvider) {
super(new StreamResource(fileName, fileProvider), "");
getElement().setAttribute("download", fileName);
add(button);
getStyle().set("display", "contents");
}
}
this is the output file

ObjectOutputStream is part of the Java serialization system. In addition to the data itself, it also includes metadata about the original Java types and such. It's only intended for writing data that will later be read back using ObjectInputStream.
To create a file for others to download, you could instead use a PrintWriter that wraps the original output stream. On the other hand, you're using the output stream to create a byte[] which means that a more straightforward, but slightly less efficient, way would be to create a concatenated string from all the array elements and then use getBytes(StandardCharsets.UTF_8) on it to directly get a byte array.

Deserialize Avro Data from bytes

I am trying to deserialize, i.e., get an object of class org.apache.avro.generic.GenericRecord from byte array Avro data. This data contains a header with the full schema.
So far, I have tried this:
public List<GenericRecord> deserializeGenericWithSchema(byte[] message) throws IOException {
List<GenericRecord> listOfRecords = new ArrayList<>();
DatumReader<GenericRecord> reader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(new SeekableByteArrayInput(message), reader);
GenericRecord record = null;
while (fileReader.hasNext()) {
listOfRecords.add(fileReader.next(record));
}
return listOfRecords;
}
But I am getting an error:
java.io.IOException: Invalid int encoding at
org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145) at
org.apache.avro.io.BinaryDecoder.readBytes(BinaryDecoder.java:282) at
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:112)
at org.apache.avro.file.DataFileReader.(DataFileReader.java:97)
However, if I write to disk the byte array message and change my function like:
public List<GenericRecord> deserializeGenericWithSchema(String fileName) throws IOException {
byte[] file = new File(fileName);
List<GenericRecord> listOfRecords = new ArrayList<>();
DatumReader<GenericRecord> reader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(file, reader);
GenericRecord record = null;
while (fileReader.hasNext()) {
listOfRecords.add(fileReader.next(record));
}
return listOfRecords;
}
It works flawlessly. I really don't want to write to disk every avro message I get because this is intended to work in a real time basis.
What am I doing wrong in my first approach?

Do you have any follow up on the issue? My assumption is encoding issue. Where the byte[] came from? Is it the exact byte[] you are writing to the disk? Maybe the explanation is on both File writer and reader default encoding settings.

Java: Hashmap with contents compiled

I am looking to implement a HashMap with its contents in the bytecode. This would be similar to me serializing the content and then reading it in. But in my experience serialization only works with saving it to a file and then reading it in, I would want this implementation to be faster than that.

But in my experience serialization only works with saving it to a file and then reading it in, I would want this implementation to be faster than that.
Serialization works with streams. Specifically, ObjectOutputStream can wrap any OutputStream. If you want to perform in-memory serialization, you could use ByteArrayOutputStream here.
Similarly on the input side.

You can save your HashMap as byte array using Java Serialization mechanizm
Map map = new HashMap();
map.put(1, 1);
ByteArrayOutputStream bout = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(bout);
oos.writeObject(map);
oos.close();
byte[] bytes = bout.toByteArray();
// restore from bytes
ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(bytes));
map = (Map) ois.readObject();
System.out.println(map);
output
{1=1}
not that both keys and values in the Map must be Serializable otherwise it wont work

How to test a Weka Text Classification (FilteredClassifier)

Looked at lots of examples for this, and so far no luck. I'd like to classify free text.
Configure a text classifier. (FilteredClassifier using StringToWordVector and LibSVM)
Train the classifier (add in lots of documents, train on filtered text)
Serialize the FilteredClassifier to disk, quit the app
Then later
Load up the serialized FilteredClassifier
Classify stuff!
It goes ok up to when I try to read from disk and classify things. All the documents and examples show the training list and testing list being built at the same time, and in my case, I'm trying to build a testing list after the fact.
A FilteredClassifier alone is not enough to create a testing Instance with the same "dictionary" as the original training set, so how do I save everything I need to classify at a later date?
http://weka.wikispaces.com/Use+WEKA+in+your+Java+code just says "Instances loaded from somewhere" and doesn't say anything about using a similar dictionary.
ClassifierFramework cf = new WekaSVM();
if (!cf.isTrained()) {
train(cf); // Train, save to disk
cf = new WekaSVM(); // reloads from file
}
cf.test("this is a test");
Ends up throwing
java.lang.ArrayIndexOutOfBoundsException: 2
at weka.core.DenseInstance.value(DenseInstance.java:332)
at weka.filters.unsupervised.attribute.StringToWordVector.convertInstancewoDocNorm(StringToWordVector.java:1587)
at weka.filters.unsupervised.attribute.StringToWordVector.input(StringToWordVector.java:688)
at weka.classifiers.meta.FilteredClassifier.filterInstance(FilteredClassifier.java:465)
at weka.classifiers.meta.FilteredClassifier.distributionForInstance(FilteredClassifier.java:495)
at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:70)
at ratchetclassify.lab.WekaSVM.test(WekaSVM.java:125)

Serialize your Instances which holds the definition of the trained data -similar dictionary?- while you are serializing your classifier:
Instances trainInstances = ... //
Instances trainHeader = new Instances(trainInstances, 0);
trainHeader.setClassIndex(trainInstances .classIndex());
OutputStream os = new FileOutputStream(fileName);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(os);
objectOutputStream.writeObject(classifier);
if (trainHeader != null)
objectOutputStream.writeObject(trainHeader);
objectOutputStream.flush();
objectOutputStream.close();
To desialize:
Classifier classifier = null;
Instances trainHeader = null;
InputStream is = new BufferedInputStream(new FileInputStream(fileName));
ObjectInputStream objectInputStream = new ObjectInputStream(is);
classifier = (Classifier) objectInputStream.readObject();
try { // see if we can load the header
trainHeader = (Instances) objectInputStream.readObject();
} catch (Exception e) {
}
objectInputStream.close();
Use trainHeader to create new Instance:
int numAttributes = trainHeader.numAttributes();
double[] vals = new double[numAttributes];
for (int i = 0; i < numAttributes - 1; i++) {
Attribute attribute = trainHeader.attribute(i);
//If your attribute is nominal or string:
double value = attribute.indexOfValue(myStrVal); //get myStrVal from your source
//If your attribute is numeric
double value = myNumericVal; //get myNumericVal from your source
vals[i] = value;
}
vals[numAttributes] = Instance.missingValue();
Instance instance = new Instance(1.0, vals);
instance.setDataset(trainHeader);
return instance;

How to read and write a HashMap to a file?

I have the following HashMap:
HashMap<String,Object> fileObj = new HashMap<String,Object>();
ArrayList<String> cols = new ArrayList<String>();
cols.add("a");
cols.add("b");
cols.add("c");
fileObj.put("mylist",cols);
I write it to a file as follows:
File file = new File("temp");
FileOutputStream f = new FileOutputStream(file);
ObjectOutputStream s = new ObjectOutputStream(f);
s.writeObject(fileObj);
s.flush();
Now I want to read this file back to a HashMap where the Object is an ArrayList.
If i simply do:
File file = new File("temp");
FileInputStream f = new FileInputStream(file);
ObjectInputStream s = new ObjectInputStream(f);
fileObj = (HashMap<String,Object>)s.readObject();
s.close();
This does not give me the object in the format that I saved it in.
It returns a table with 15 null elements and the < mylist,[a,b,c] > pair at the 3rd element. I want it to return only one element with the values I had provided to it in the first place.
//How can I read the same object back into a HashMap ?
OK So based on Cem's note: This is what seems to be the correct explanation:
ObjectOutputStream serializes the objects (HashMap in this case) in whatever format that ObjectInputStream will understand to deserialize and does so generically for any Serializable object.
If you want it to serialize in the format that you desire you should write your own serializer/deserializer.
In my case: I simply iterate through each of those elements in the HashMap when I read the Object back from the file and get the data and do whatever I want with it. (it enters the loop only at the point where there is data).
Thanks,

You appear to be confusing the internal resprentation of a HashMap with how the HashMap behaves. The collections are the same. Here is a simple test to prove it to you.
public static void main(String... args)
throws IOException, ClassNotFoundException {
HashMap<String, Object> fileObj = new HashMap<String, Object>();
ArrayList<String> cols = new ArrayList<String>();
cols.add("a");
cols.add("b");
cols.add("c");
fileObj.put("mylist", cols);
{
File file = new File("temp");
FileOutputStream f = new FileOutputStream(file);
ObjectOutputStream s = new ObjectOutputStream(f);
s.writeObject(fileObj);
s.close();
}
File file = new File("temp");
FileInputStream f = new FileInputStream(file);
ObjectInputStream s = new ObjectInputStream(f);
HashMap<String, Object> fileObj2 = (HashMap<String, Object>) s.readObject();
s.close();
Assert.assertEquals(fileObj.hashCode(), fileObj2.hashCode());
Assert.assertEquals(fileObj.toString(), fileObj2.toString());
Assert.assertTrue(fileObj.equals(fileObj2));
}

I believe you´re making a common mistake. You forgot to close the stream after using it!
File file = new File("temp");
FileOutputStream f = new FileOutputStream(file);
ObjectOutputStream s = new ObjectOutputStream(f);
s.writeObject(fileObj);
s.close();

you can also use JSON file to read and write MAP object.
To write map object into JSON file
ObjectMapper mapper = new ObjectMapper();
Map<String, Object> map = new HashMap<String, Object>();
map.put("name", "Suson");
map.put("age", 26);
// write JSON to a file
mapper.writeValue(new File("c:\\myData.json"), map);
To read map object from JSON file
ObjectMapper mapper = new ObjectMapper();
// read JSON from a file
Map<String, Object> map = mapper.readValue(
new File("c:\\myData.json"),
new TypeReference<Map<String, Object>>() {
});
System.out.println(map.get("name"));
System.out.println(map.get("age"));
and import ObjectMapper from com.fasterxml.jackson and put code in try catch block

Your first line:
HashMap<String,Object> fileObj = new HashMap<String,Object>();
gave me pause, as the values are not guaranteed to be Serializable and thus may not be written out correctly. You should really define the object as a HashMap<String, Serializable> (or if you prefer, simpy Map<String, Serializable>).
I would also consider serializing the Map in a simple text format such as JSON since you are doing a simple String -> List<String> mapping.

I believe you're getting what you're saving. Have you inspected the map before you save it? In HashMap:
/**
* The default initial capacity - MUST be a power of two.
*/
static final int DEFAULT_INITIAL_CAPACITY = 16;
e.g. the default HashMap will start off with 16 nulls. You use one of the buckets, so you only have 15 nulls left when you save, which is what you get when you load.
Try inspecting fileObj.keySet(), .entrySet() or .values() to see what you expect.
HashMaps are designed to be fast while trading off memory. See Wikipedia's Hash table entry for more details.

Same data if you want to write to a text file
public void writeToFile(Map<String, List<String>> failureMessage){
if(file!=null){
try{
BufferedWriter writer=new BufferedWriter(new FileWriter(file, true));
for (Map.Entry<String, List<String>> map : failureMessage.entrySet()) {
writer.write(map.getKey()+"\n");
for(String message:map.getValue()){
writer.write(message+"\n");
}
writer.write("\n");
}
writer.close();
}catch (Exception e){
System.out.println("Unable to write to file: "+file.getPath());
e.printStackTrace();
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: storing a big map in resources - java

You could consider one of many fast serialization libraries: protobuf (https://github.com/google/protobuf) flat buffers (https://google.github.io/flatbuffers/) cap'n proto (https://capnproto.org)

Related

Strings in downloadfile weird symbols

Deserialize Avro Data from bytes

Java: Hashmap with contents compiled

How to test a Weka Text Classification (FilteredClassifier)

How to read and write a HashMap to a file?

Categories

Resources