Improving a hashing routine

Improving a hashing routine - java

I have the following (pretty nasty) code snippet, which is generating a md5-hash over the content of an item:
protected String createHashFromContentNew() throws CrawlerException {
final StringBuilder builder = new StringBuilder();
if (getContent() != null) {
builder.append(new String(getContent()));
}
if (builder.length() == 0) {
throw new CrawlerException(hashErrorMessage("the content of this item is empty!"));
} else {
return MD5Utils.generateMD5Hash(builder.toString());
}
}
the MD5Utils.generateMD5Hash(builder.toString()); function could also be used with a InputStream.
getContent() returns a byte[].
This actually worked ok until I got items with huge sized contents. Since this is used in an multi-threaded environment it uses up a lot of RAM by holding the content multiple times.
I now want to use the generateMD5Hash() with the InputStream, to stop loading everything into the RAM. The problem is, that the outcoming hash must be the same as in the current function for all previously generated hashes.
Any ideas how to achive that in a proper way?

Maybe you want ByteArrayInputStream ?
Have a look here.

Related

from c# to java, how to swap method of 'Serializer.TryReadLengthPrefix'

I'm trying to render the following method into java from c#.
some components are easily recognizable, for instance (please correct me if I'm wrong but) it seems that:
C#: Java
Console.WriteLine = System.out.println
Some components are more opake. Such as using, I guess that has no equivalent in java, isn't it? So I'm thinking I'll just ignore it, is that prudent?
A little background before we go on, I'm trying to decode a google protocol buffer .pb file.
Serializer.TryReadLengthPrefix(file, PrefixStyle.Base128, out len) is doubtless tricky as well, but it's the whole crux of the program, so it's important.
I'm reasonably certain that in place of that I should use something like this:
while ((r = Relation.parseDelimitedFrom(is)) != null) {
RelationAndMentions relation = new RelationAndMentions(
r.getRelType(), r.getSourceGuid(), r.getDestGuid());
labelCountHisto.incrementCount(relation.posLabels.size());
relTypes.addAll(relation.posLabels);
relations.add(relation);
for(int i = 0; i < r.getMentionCount(); i ++) {
DocumentProtos.Relation.RelationMentionRef mention = r.getMention(i);
// String s = mention.getSentence();
relation.mentions.add(new Mention(mention.getFeatureList()));
}
for(String l: relation.posLabels) {
addKnownRelation(relation.arg1, relation.arg2, l, knownRelationsPerEntity);
}
}
But that's an unwieldy beast and I'm not sure exactly what to do with it.
I've been at this too long and my capacity to think clearly is totally disipated but if one among you who is expert in c# and java feels up to this momentus undertaking, far be it from me to stop you.
static void ProcessFile(string path)
{
try
{
Console.WriteLine("Processing: {0}", path);
using (var file = File.OpenRead(path))
{
int len, count = 0;
while(Serializer.TryReadLengthPrefix(file, PrefixStyle.Base128, out len))
{
Console.WriteLine("Fragment: {0} bytes", len);
using (var reader = new ProtoReader(file, null, null, len))
{
ProcessRelation(reader);
count++;
}
}
Console.WriteLine("{0}, {1} Relation objects parsed", path, count);
Console.Error.WriteLine("{0}, {1} Relation objects parsed", path, count);
}
}
catch (Exception ex)
{
Console.Error.WriteLine(ex.Message);
}
finally
{
Console.WriteLine();
}
}
if you're feeling particularly ambitious please do dig the who code here.

Can you pls. elaborate more on
I'm trying to decode a google protocol buffer .pb file.
Usually you have protobuf definition file.. Google protobuff, auto-generates "code" file based on this "def" file. Generally there is no point in decoding that file...
you generate the C# code file using definition.pb/proto by calling protoc.exe
again you generate the code file in Java using same definition.pb/proto and by same calling protoc.exe ( same command switch will be different)
and you can communicate across both language.. you dont have to map/find equivalent.. hopefully I answered your q..
if you have further q. post your .pb/proto file (definition not data)
if you don't have the definition file.. let us know.. we can than may be able to take further steps.

Java Throwing OutOfMemory Even Though (I Think) I've Set References To Null

I have an application that takes a number of different images, makes new images from them and then saves them for use to make a video. The images are all PNG's and the videos are several minutes long so the program requires a lot of memory (one image every 33.33 MS of video play time). When I process a single video, it all works fine. I can even process several videos and it all works fine. But, eventually, I get an outofmemory error if I try to process 1 + n videos.
What is confusing me is how that error happens. Here is the part of the program where the error happens:
ComposeVideoController cvc = new ComposeVideoController();
boolean made = cvc.setXmlUrl(sourcePath, saveDir, fileId);
cvc = null;
To be more precise, the error happens in one of the frame construction classes which is referenced by the ComposeVideoController. ComposeVideoController is scoped to a single void method that runs recursively (if there are more videos to be made). I have gone through all the objects referenced by ComposeVideoController, which is the entry point to the library that builds the videos, and made sure they are all set to null too.
How can I get outofmemory errors in ComposeVideoController when any individual video does not cause an outofmemory error and ComposeVideoController is out of scope (and set null) after any given video is made?
The full recursion is shown below. I have one method that checks to see if there are new messages in queue (messages are sent by Socket) and if there are, it calls the method that processes the video. If not, the recursion ends:
private void processQueue() {
if(makingVideo)
return;
MakeVideoObject mvo = queue.remove(0);
makingVideo = true;
String[] convertArr = mvo.getConvertArrayCommand();
String sourcePath = convertArr[1];
String fileId = convertArr[2] + ".mp4";
String saveDir = convertArr[3] + System.getProperty("file.separator");
try {
ComposeVideoController cvc = new ComposeVideoController();
boolean made = cvc.setXmlUrl(sourcePath, saveDir, fileId);
cvc = null;
if(made) {
cleanDir(mvo);
}
}
catch(Exception e) {
e.printStackTrace();
}
}
/**
* Moves all the assets off to a storage directory where we can be
* able to recover the video assets if something goes wrong during
* video creation.
*
* #param mvo
*/
private void cleanDir(MakeVideoObject mvo) {
String[] convertArr = mvo.getConvertArrayCommand();
String sourceDir = convertArr[1];
String saveDir = convertArr[3] + System.getProperty("file.separator");
String fileId = convertArr[2];
sourceDir = sourceDir.substring(0, sourceDir.lastIndexOf(System.getProperty("file.separator")));
try {
File f = new File(sourceDir);
File[] files = f.listFiles();
for(File file : files) {
if(file.getName().indexOf(fileId) != -1) {
file.renameTo(new File(saveDir + file.getName()));
}
}
makingVideo = false;
mvo = null;
if(queue.size() > 0) {
processQueue();
}
}
catch(Exception e) {
e.printStackTrace();
}
}
[Edited to show more of the program]

thats pretty much what happens if you execute nontrivial code recursively (either this or a classic stack overflow, whichever occurs first) - recursion is VERY resource-intensive, one should avoid it at all costs. Simply exchanging your recursion with an iterative algorithm will make your error go away, most likely

I'm posting an answer since I've figured it out finally and in the off-case this helps anyone else trying to diagnose a memory leak in the future.
During my profiling, I could see that there were objects held in memory, but it made no sense to me as they had been set to null. After going through one of the objects again, I noticed that I had declared it static. Because it was static, it also had static members, one of which was a ConcurrentHashMap... So, those Maps were having things added to them and since the object was static, the object and its members would never be dereferenced. Another lesson for me as to why I almost never declare objects static.

Use -Xmx to increase your memory space. But also, I'd advise explicitly calling System.gc() right about where you're getting the OutOfMemoryException... If you are nulling out your other objects, this will help a lot

Is this usage of a cache map safe in a multithreaded environment?

I'm implementing a LRU cache for photos of users, using Commons Collections LRUMap (which is basicly a LinkedHashMap with small modifications). The findPhoto method can be called several hundred times within a few seconds.
public class CacheHandler {
private static final int MAX_ENTRIES = 1000;
private static Map<Long, Photo> photoCache = Collections.synchronizedMap(new LRUMap(MAX_ENTRIES));
public static Map<Long, Photo> getPhotoCache() {
return photoCache;
}
}
Usage:
public Photo findPhoto(Long userId){
User user = userDAO.find(userId);
if (user != null) {
Map<Long, Photo> cache = CacheHandler.getPhotoCache();
Photo photo = cache.get(userId);
if(photo == null){
if (user.isFromAD()) {
try {
photo = LDAPService.getInstance().getPhoto(user.getLogin());
} catch (LDAPSearchException e) {
throw new EJBException(e);
}
} else {
log.debug("Fetching photo from DB for external user: " + user.getLogin());
UserFile file = userDAO.findUserFile(user.getPhotoId());
if (file != null) {
photo = new Photo(file.getFilename(), "image/png", file.getFileData());
}
}
cache.put(userId, photo);
}else{
log.debug("Fetching photo from cache, user: " + user.getLogin());
}
return photo;
}else{
return null;
}
}
As you can see I'm not using synchronization blocks. I'm assuming that the worst case scenario here is a race condition that causes two threads to run cache.put(userId, photo) for the same userId. But the data will be the same for two threads, so that is not an issue.
Is my reasoning here correct? If not, is there a way to use a synchronization block without getting a large performance hit? Having only 1 thread accessing the map at a time feels like overkill.

Assylias is right that what you've got will work fine.
However, if you want to avoid fetching images more than once, that is also possible, with a bit more work. The insight is that if a thread comes along, makes a cache miss, and starts loading an image, then if a second thread comes along wanting the same image before the first thread has finished loading it, then it should wait for the first thread, rather than going and loading it itself.
This is fairly easy to coordinate using some of Java's simpler concurrency classes.
Firstly, let me refactor your example to pull out the interesting bit. Here's what you wrote:
public Photo findPhoto(User user) {
Map<Long, Photo> cache = CacheHandler.getPhotoCache();
Photo photo = cache.get(user.getId());
if (photo == null) {
photo = loadPhoto(user);
cache.put(user.getId(), photo);
}
return photo;
}
Here, loadPhoto is a method which does the actual nitty-gritty of loading a photo, which isn't relevant here. I assume that the validation of the user is done in another method which calls this one. Other than that, this is your code.
What we do instead is this:
public Photo findPhoto(final User user) throws InterruptedException, ExecutionException {
Map<Long, Future<Photo>> cache = CacheHandler.getPhotoCache();
Future<Photo> photo;
FutureTask<Photo> task;
synchronized (cache) {
photo = cache.get(user.getId());
if (photo == null) {
task = new FutureTask<Photo>(new Callable<Photo>() {
#Override
public Photo call() throws Exception {
return loadPhoto(user);
}
});
photo = task;
cache.put(user.getId(), photo);
}
else {
task = null;
}
}
if (task != null) task.run();
return photo.get();
}
Note that you need to change the type of CacheHandler.photoCache to accommodate the wrapping FutureTasks. And since this code does explicit locking, you can remove the synchronizedMap from it. You could also use a ConcurrentMap for the cache, which would allow the use of putIfAbsent, a more concurrent alternative to the lock/get/check for null/put/unlock sequence.
Hopefully, what is happening here is fairly obvious. The basic pattern of getting something from the cache, checking to see if what you got was null, and if so putting something back in is still there. But instead of putting in a Photo, you put in a Future, which is essentially a placeholder for a Photo which may not (or may) be there right at that moment, but which will become available later. The get method on Future gets the thing that a place is being held for, blocking until it arrives if necessary.
This code uses FutureTask as an implementation of Future; this takes a Callable capable of producing a Photo as a constructor argument, and calls it when its run method is called. The call to run is guarded with a test that essentially recapitulates the if (photo == null) test from earlier, but outside the synchronized block (because as you realised, you really don't want to be loading photos while holding the cache lock).
This is a pattern i've seen or needed a few times. It's a shame it's not built into the standard library somewhere.

Yes you are right - if the photo creation is idempotent (always returns the same photo), the worst thing that can happen is that you will fetch it more than once and put it into the map more than once.

An Iterator which mutates and returns the same object. Bad practice?

I'm writing GC friendly code to read and return to the user a series of byte[] messages. Internally I reuse the same ByteBuffer which means I'll repeatedly return the same byte[] instance most of the time.
I'm considering writing cautionary javadoc and exposing this to the user as a Iterator<byte[]>. AFAIK it won't violate the Iterator contract, but the user certainly could be surprised if they do Lists.newArrayList(myIterator) and get back a List populated with the same byte[] in each position!
The question: is it bad practice for a class that may mutate and return the same object to implement the Iterator interface?
If so, what is the best alternative? "Don't mutate/reuse your objects" is an easy answer. But it doesn't address the cases when reuse is very desirable.
If not, how do you justify violating the principle of least astonishment?
Two minor notes:
I'm using Guava's AbstractIterator so remove() isn't really of concern.
In my use case the user is me and the visibility of this class will be limited, but I've tried to ask this generally enough to apply more broadly.
Update: I'm accepting Louis' answer because it has 3x more votes than Keith's, but note that in my use case I'm planning to take the code that I left in a comment on Keith's answer to production.

EnumMap did essentially exactly this in its entrySet() iterator, which causes confusing, crazy, depressing bugs to this day.
If I were you, I just wouldn't use an Iterator -- I'd write a different API (possibly quite dissimilar from Iterator, even) and implement that. For example, you might write a new API that takes as input the ByteBuffer to write the message into, so users of the API could control whether or not the buffer gets reused. That seems reasonably intuitive (the user can write code that obviously and cleanly reuses the ByteBuffer), without creating unnecessarily cluttered code.

I would define an intermediate object which you can invalidate. So your function would return an Iterator<ByteArray>, and ByteArray is something like this:
class ByteArray {
private byte[] data;
ByteArray(byte[] d) { data = d; }
byte[] getData() {
if (data == null) throw new BadUseOfIteratorException();
return data;
}
void invalidate() { data = null; }
}
Then your iterator can invalidate the previously returned ByteArray so that any future access (via getData, or any other accessor you provide) will fail. Then at least if someone does something like Lists.newArrayList(myIterator), they will at least get an error (when the first invalid ByteArray is accessed) instead of silently returning the wrong data.
Of course, this won't catch all possible bad uses, but probably the common ones. If you're happy with never returning the raw byte[] and providing accessors like byte get(int idx) instead, then it should catch all cases.
You will have to allocate a new ByteArray for each iterator return, but hopefully that's a lot less expensive than copying your byte[] for each iterator return.

Just like Keith Randall I'd also create Iterator<ByteArray>, but working quite differently (the annotations below come from lombok):
#RequiredArgsConstructor
public class ByteArray {
#Getter private final byte[] data;
private final ByteArrayIterable source;
void allowReuse() {
source.allowReuse();
}
}
public class ByteArrayIterable implements Iterable<ByteArray> {
private boolean allowReuse;
public allowReuse() {
allowReuse = true;
}
public Iterator<ByteArray> iterator() {
return new AbstractIterator<ByteArray>() {
private ByteArray nextElement;
public ByteArray computeNext() {
if (noMoreElements()) return endOfData();
if (!allowReuse) nextElement =
new ByteArray(new byte[length], ByteArrayIterable.this);
allowReuse = false;
fillWithNewData(lastElement.getData());
}
}
}
}
Now in calls like Lists.newArrayList(myIterator) always a new byte array gets allocated, so everything works. In your loops like
for (ByteArray a : myByteArrayIterable) {
a.allowReuse();
process(a.getData());
}
the buffer gets reused. No harm may result, unless you call allowReuse() by mistake. If you forget to call it, then you get worse performance but correct behavior.
Now I see it could work without ByteArray, the important thing is that myByteArrayIterable.allowReuse() gets called, which could be done directly.

How do I determine the size of a dynamically allocated structure in Java before writing it to a file?

Say, for example, I have a complex dynamically allocated structure (such as a binary tree) that needs to be written to a file made up of different sections. I would like to first write the size of the structure as a dword followed by the structure itself, however the size of the structure is only known after I have written the structure to the file. It is difficult, in this case, to pre-determine the size of the structure in memory.
Is it best to write the size as 0, then write the structure, then seek back and overwrite the size with the correct value? I don't like that idea, though. Is there a better/proper way to do it?

Just an idea: write the data to a ByteArrayOutputStream, after that, you should be able to call size() to get the actual length in bytes and call toByteArray() to get the byte buffer, that can be written to a file.
Code example
public static void main (String[] args) throws java.lang.Exception {
ArrayList objects = new ArrayList();
objects.add("Hello World");
objects.add(new Double(42.0));
System.out.println(sizeof(objects));
}
public static int sizeof(Serializable object) {
ObjectOutputStream out = null;
ByteArrayOutputStream baos = null;
try {
baos = new ByteArrayOutputStream();
out = new ObjectOutputStream(baos);
out.writeObject(object);
} catch (Exception e) {
e.printStackTrace();
} finally {
if (out != null) {
try {
out.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
return baos != null? baos.size() : -1;
}
This just demonstrate a sizeof emulator (which is different from the c implementation, because it calculates the size of a serialized object - the implementation for raw bytes would be slightly different).

Did you looked at Random Access Files yet?

Why do you need to write the size at all? Won't the file be the size of the structure after you have written it?
If you have variable components like arrays or lists, you can write the sizes of those as you write the data. However the total length is redundant and not very useful.
If you really have to, you can write the data to a ByteArrayOutputStream first to get the length. (But I seriously doubt it)

Please refer the below url http://www.javapractices.com/topic/TopicAction.do?Id=83 for calculating size of object .This utility seems worthful for your need.
To measure the size of a particular object containing data, measure JVM memory use before and after building the object.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Improving a hashing routine - java

Maybe you want ByteArrayInputStream ? Have a look here.

Related

from c# to java, how to swap method of 'Serializer.TryReadLengthPrefix'

Java Throwing OutOfMemory Even Though (I Think) I've Set References To Null

Is this usage of a cache map safe in a multithreaded environment?

An Iterator which mutates and returns the same object. Bad practice?

How do I determine the size of a dynamically allocated structure in Java before writing it to a file?

Categories

Resources