I have a hashmap with large number of entry's which is serialized.If i make a small change in hashmap is it required that I overwrite the old file completely or is there an alternative ?
public class HashMapSerial {
public static void main(String[] args) {
HashMap<String,Integer> hash=new HashMap<String, Integer>(100000);
hash.put("hello",1 );
hash.put("world", 2);
//+ (100000 -2) entry's
ObjectOutputStream s=new ObjectOutputStream(new FileOutputStream(new File("hash.out")));
s.writeObject(hash); // write the hash map to file
hash.put("hello",10);
s.writeObject(hash); //rewrite the whole hashmap again
}
}
Since the change is only for the string "hello" and for no other element is it possible to update the serialized file only for the string "hello" instead of rewriting the whole hashmap once again ?
Use DB or simple File IO maintain upto where you have written previously .
AFAIK, you can't do incremental saves with the simple java serialization.
You should instead use another system to store your data (such as a database).
Maybe it's overkill but a NoSQL db (cassandra for instance) would be simpler than trying to create your own system.
Related
I was trying to insert Strings into StringBuffer using foreach method of parallelStream(), created from a Set collection.
The problem is every time I execute the code, the final String (StringBuffer.toString()) has 1 less element of the total (random element every time I try).
I also change the StringBuffer to StringBuilder, the parallelStream() to stream(), but always has 1 less element.
I am using:
- Java version: java 1.8_121
- Server: Weblogic 12.2.1.2 (I don't think this is relevant to the problem)
- Spring boot 2.0.2.RELEASE (I don't think this is relevant to the problem)
NOTE: I used a Map to keep the pdfs I should sign later on the process (in another HTTP request).
Map<String, ClientPdf> dataToEncript = new HashMap<>(); // pdf name it will be the key for this map (it is unique in the sql query)
List<Client> listClients = // list of clients from database
Set<ClientPdf> clientsPdf = new HashSet<>();
for (Client client : listClients) {
clientsPdf.add(client.clientPdf()); // clientPdf() generate a new object ClientPdf, which is similar to Client class, but with less fields (essential for the Set)
}
log.debug("Generating documents");
clientsPdf.parallelStream().forEach(pdf -> {
// some code to generate pdf
log.debug("Inserting pdf: {}", pdf); // this log print, for example, 27.000 lines
dataToEncript.put(pdf.getPdfName(), pdf);
});
StringBuffer sb = new StringBuffer(); // StringBuffer or StringBuilder, the same problem
for (ClientPdf clientPdf : dataToEncript.values()) {
sb.append(clientPdf.getPdfName() + ";" + clientPdf.getRut() + "\n"); // appending all values of de map dataToEncript, it will append 26.669 (1 less)
}
clientsPdf.parallelStream().forEach(pdf -> {
// ...
dataToEncript.put(pdf.getPdfName(), pdf);
});
dataToEncript is not a thread-safe data structure, so this is likely to cause ridiculous and weird bugs like the one you're observing
In general, using forEach is often a bad sign, and you should almost always be using a Collector or some other method. For example, here you should probably use
clientsPdf.parallelStream()
.collect(Collectors.toConcurrentMap(Pdf::getPdfName, pdf -> pdf));
to get a correct map out.
Even better, you could write
clientsPdf.parallelStream()
.map(clientPdf -> clientPdf.getPdfName() + ";" + clientPdf.getRut() + "\n")
.collect(Collectors.joining())
to get the final String out without any manual management of StringBuffer or the like.
Because HashMap is not thread-safe as Wasserman mentioned above.
It may cause an inconsistency in the state of HashMap if multiple threads are accessing the same object and try to modify its structure.
Therefore, HashTable, SynchronizedMap or ConcurrentHashMap are introduced to use HashMap in multi-thread environment (such as parallelStream()).
You can simply rewrite the first row of your code as follows:
Map<String, ClientPdf> dataToEncript = Collections.synchronizedMap(new HashMap<>());
Now, you are supposed to get the correct result after rerunning your program.
BTW, both HashTable and SynchronizedMap are not good in performance, you can use ConcurrentHashMap instead to overcome this issue.
Good luck!
Let's say I have a Stream with elements of type String. I want to write each element in the stream to a separate file in some folder. I'm using the following set up.
stream.writeAsText(path).setParallelism(1);
How do I make this path dynamic? I even tried adding System.nanotime() to the path to make it dynamic. But it still doesn't seem to work, everything gets written to a single file.
This sort of use case is explicitly supported in Flink by the Rolling File Sink with a custom bucketer, or the newer and prefered Streaming File Sink with a custom BucketAssigner and RollingPolicy.
Your problem is that DataStream.writeAsText() writes the entire content of the stream to the file at once, so you will only ever get a single file.
It looks like this will return a collection that you can use to output your strings as different files.
dataStream.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out)
throws Exception {
for(String word: value.split(" ")){
out.collect(word);
}
}
});
Taken straight from the documentation here: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html
I am trying to write a program that takes a huge data set and then run some queries on it using mapreduce. I have a code like this:
public static class MRMapper
extends Mapper<LongWritable, Text, Text, IntWritable>{
String output2="hdfs://master:9000/user/xxxx/indexln.txt";
FileSystem Phdfs =FileSystem.get(new Configuration());
Path fname1=new Path(output2);
BufferedWriter out=new BufferedWriter(new OutputStreamWriter(Phdfs.create(fname1,true)));
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
long max=0;
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
String binln = Long.toBinaryString(0x8000000000000000L | key).substring(1);
out2.write(binln+"\n");
out2.flush();
String line = value.toString();
String [] ST = line.split(",");
long val=Math.abs(Long.parseLong(ST[2]));
if (max < val){
max= val;
}
else{
word.set(line);
context.write(word, val);
}
}
}
What I am trying to do is to build an indexfile in the mapper. which would be used to access specific areas of the input file by the mappers. The mappers read a part of the input file based on the index and then print the part read and the number of lines read to the output. I am using one mapper with 9 reducers.
My question goes, is it possible to create/write to a file different from the output file in the map function and also, can a reducer read a file that is open in the mapper? If yes, Am i on the right path or totally wrong or maybe mapreduce is not the way for this? I apologize if this question sounds too noob but I'm actually a noob in hadoop. Trying to learn. thanks
Are you sure your are using a single mapper? Because Hadoop creates a number of mappers very close to the number of input splits (more details).
The concept of input split is very important as well: it means very big data files are splited into several chuncks, each chunck assigned to a mapper. Thus, unless you are totally sure only one mapper is being used, you wont be able to control which part of the file you are workin on, and you will not be able to control any kind of global index.
Being said that, by using a single mapper in MapReduce is the same than not using MapReduce at all :) Maybe the mistake is mine, and I'm assuming you have only one file to be analyzed, is that the case?
In the case you have several big data files the scenario changes, and it could make sense to create a single mapper for each file, but you will have to create your own InputSplit and override the isSplitable method by returning always false.
I have offline JSON definitions (in assets folder) - and with them I create my data model. It has like 8 classes which all inherit (extend) one abstract Model class.
Would it be better solution if I parse the JSON and keep the model in memory (more or less everything is Integer or String) through the whole life cycle of the App or would it be smarter if I parse the JSON files as they are needed?
thanks
Parsing the files and storing all the data in the memory will definitely give you a speed advantage. The problem with this solution is that if your application will go to back-ground (the user receives a phone call or just leaves the app by his will), no one can guarantee that the data will stay intact in memory.
This data can be clear by the GC if the system decided that it needs more memory.
This means that when the user comes back to the application and if you relay on the fact that the data is in the memory you might face an exception. So you need to consider this situation.
And from that point of you it is good to store you data on a file that can be parsed at a desired time, even thought this might be a slower solution.
Another solution you may look at is to parse this data at first application start-up to an SQLite DB and use it from there, or even store it in the DB in the first place. This will give you the advantages of both worlds, you would not have to parse the data before using it, and you will have a quick access to it using a Cursor and you are not facing the problem of data deletion in case of insufficient memory in the system.
I'd read all the file content at once and keep it as a static String somewhere in my application that is available to all application components (SingleTone Pattern) since usually Maintaining a small string in the memory is much cheaper than opening and closing files frequently.
To solve the GC point #Emil pointed out you can write your code something like this:
public class DataManager {
private static String myData;
public static String getData(Context context){
if(myData == null){
loadData(context);
}
return myData;
}
private static void LoadData(Context context){
context.getAssets().
try {
BufferedReader reader = new BufferedReader(
new InputStreamReader(getAssets().open("data.txt"), "UTF-8"));
StringBuilder builder = new StringBuilder();
do {
String mLine = reader.readLine();
builder.append(mLine);
} while (mLine != null)
reader.close();
myData = builder.toString();
} catch (IOException e) {
}
}
}
And from any class in your application that has a valid Context reference:
String data = DataManager.getData(context);
I have a Hashtable<string,string>, in my program I want to record the values of the Hashtable to process later.
My question is: can we write object Hastable to a file? If so, how can we later load that file?
Yes, using binary serialization (ObjectOutputStream):
FileOutputStream fos = new FileOutputStream("t.tmp");
ObjectOutputStream oos = new ObjectOutputStream(fos);
oos.writeObject(yourHashTable);
oos.close();
Then you can read it using ObjectInputStream
The objects that you put inside the Hashtable (or better - HashMap) have to implement Serializable
If you want to store the Hashtable in a human-readable format, you can use java.beans.XMLEncoder:
FileOutputStream fos = new FileOutputStream("tmp.xml");
XMLEncoder e = new XMLEncoder(fos);
e.writeObject(yourHashTable);
e.close();
Don't know about your specific application, but you might want to have a look at the Properties class. (It extends hashmap.)
This class provides you with
void load(InputStream inStream)
Reads a property list (key and element pairs) from the input byte stream.
void load(Reader reader)
Reads a property list (key and element pairs) from the input character stream in a simple line-oriented format.
void loadFromXML(InputStream in)
Loads all of the properties represented by the XML document on the specified input stream into this properties table.
void store(Writer writer, String comments)
Writes this property list (key and element pairs) in this Properties table to the output character stream in a format suitable for using the load(Reader) method.
void storeToXML(OutputStream os, String comment)
Emits an XML document representing all of the properties contained in this table.
The tutorial is quite educational also.
If you want to be able to easily edit the map once it's written out, you might want to take a look at jYaml. It allows you to easily write the map to a Yaml-formatted file, meaning it's easy to read and edit.
You could also use MapDB and it will save the HashMap for you after you do a put and a commit.
That way if the program crashes the values will still be persisted.