I need to invoke tesseract OCR (its an open source library in C++ that does Optical Character Recognition) from a Java Application Server. Right now its easy enough to run the executable using Runtime.exec(). The basic logic would be
Save image that is currently held in memory to file (a .tif)
pass in the image file name to the tesseract command line program.
read in the output text file from Java using FileReader.
How much improvement in terms of performance am I likely to get by writing a JNI wrapper for Tesseract? Unfortunately there is not an open source JNI wrapper that works in Linux. I would have to do it myself and am wondering about whether the benefit is worth the development cost.
It's hard to say whether it would be worth it. If you assume that if done in-process via JNI, the OCR code can directly access the image data without having to write it to a file, then it would certainly eliminate any disk I/O constraints there.
I'd recommend going with the simpler approach and only undertaking the JNI option if performance is not acceptable. At least then you'll be able to do some benchmarking and estimate the performance gains you might be able to realize.
If you do pursue your own wrapper, I recommend you check out JNA. It will allow you to call most "native" libraries writing only Java code, and will give you more help than does raw JNI to do it safely. JNA is available for most platforms.
I'm agree with tweakt. Do not use JNI if there is no perfomance reasons to do this. Your application stability is also could be in danger if you use JNI calls if there will be some possibilities of memory leaks or even crashes in your JNI layer or in OCR itself. This will never happen if you use it via command line interface (All memory will be released at the program exit and all abnormal program terminations can be checked in the caller code).
Related
Suppose i want to search some text in a file. I want to know when we should use system utilities/programs like grep and when we should use Java API's like reading a line, and then search the text in that line or use java Scanner class.
I want to understand the trade-offs between the two approaches. I mean, suppose if we use grep, then will there be communication overhead between JVM and the grep process? Is creation of a new OS process for grep an overhead?
Does grep performs better than normal java file search?
Please help...
Yes, there will be an overhead. Starting an external process and communicating with it is costly. And moreover, many systems don't have a grep command. If you want to make your Java code portable, don't rely on OS-specific commands.
Another problem is that OS commands will be able to search (for example) in files, but not in your in-memory data structures.
You're basically trading off system independence for the perceived gain of the tool, in some cases this can't be avoid.
Not every system will have the tools you want installed, in the locations you think they should be or the version you need.
Even if you can deploy the tools with your application, you will need to provide an implementation for each of your targeted platforms.
Sure, it's easy to say "it will never be run on X", but can never come around real quick ;)
There is also the added over head of executing and managing the IO of the external applications, while not difficult, it's much more complex that a well written Java API.
As I said, sometimes, you simply don't have a choice (I have some media inspection tools that I use on Windows and Mac that I'm not about to try and implement in Java, not because it can't be, but because it's complex and time consuming and somebody has already done it (with a native program)).
You need to balance the choice over what the benefits are of the external command weight against the issues of using it. You should also investigate if a API has already begin developed that might solve the problem at hand.
IMHO
I'm new to Java, and was told to use the Java Native Interface to run some code I wrote in C.
Now, this might be a stupid question, but what's the point of the JNI ? Can't I simply execute my process from a Java UI program and get its stdout to parse ?
Also, I've read that the use of JNI might cause security issues. Do these issues directly depend on the quality of the invoked code ? Or is this something deeper ?
Thanks.
what's the point of the JNI ?
It enables you to mix C and Java code within the same process.
Can't I simply execute my process from a Java UI program and get its stdout to parse ?
A lot of things that can be achieved by using JNI can also be achieved by using inter-process communication (IPC). However, you'd have to ship all the input data to the other process, and then ship all the results back. This can be pretty expensive, which makes IPC impractical for many situations where JNI can be used (e.g. wrapping existing C libraries).
Also, I've read that the use of JNI might cause security issues. Do these issues directly depend on the quality of the invoked code ? Or is this something deeper ?
The point here is that the JVM does a lot of work to ensure that whatever Java code is thrown at it, things like buffer overruns, stack smashing attacks etc can't occur. For example, it performs bounds checking on all array accesses (which C doesn't).
On the other hand, JNI code is a black box to the JVM. If there's a problem with the C code (e.g. a buffer overrun), all bets are off.
Can't I simply execute my process from a Java UI program and get its stdout to parse ?
Do you think it's always appropriate to start a new process every time you want to execute any native code? Do you really want to be transferring potentially large amounts of data between processes? (Imagine a native image transformation.)
Also, I've read that the use of JNI might cause security issues. Do these issues directly depend on the quality of the invoked code?
Yes. Basically native code has less security sandboxing than Java running in a JVM. If the code has security bugs (e.g. buffer overflows) then clearly that will affect the security of your overall app.
I should say that it's relatively rare for Java developers to need to worry about JNI - I've certainly only touched it a couple of times in my career. You may also want to look at SWIG if the need arises.
Can't I simply execute my process from a Java UI program and get its stdout to parse ?
That would depend on what you are calling.
Note that you cannot just call programs via JNI, but library code.
In addition to that, spawning new processes is relatively expensive and managing multiple processes is complicated.
I'm facing a project with audio streaming, as a client and a server. Would Java be a good choice for the server app?
I've read in other questions that because of performance C++ is the best choice for this kind of app.
If you are more comfortable with C++ or Java, I would use that. You can write a low pause server in either language.
A streaming server is mostly about passing lots of data from A to B i.e. the I/O matters. Unless you plan to compress the stream on the fly, the CPU performance is unlikely to be important.
Even if you are doing on the fly compression and Java isn't fast enough for that, you can call a library (preferably one already written/tested) to do this via JNI and still write most of the server in Java.
shrug It's not a bad choice. While audio streaming does have a performance component, the algorithms/optimizations you make are going to have a much larger effect than the language you choose.
Not to mention the famous Knuth quote "Premature optimization is the root of all evil". Write in whatever you are most comfortable with and check if it's a problem later.
The biggest issue with Java performance, I think, is garbage collection. Without careful consideration to what you're doing, it's easy in Java to write code that needs to pause every so often to clean up. C++ doesn't have that problem. On the other hand, without consideration of what you're doing, it's easy to write C++ code that leaks heap memory (when you forget to delete something from the heap). This is really bad for a long-running process like a server. It's possible to leak memory in Java, but it's related to keeping references around too long, not to anything built into the language.
Although C++ tends to be faster, with modern just-in-time compilers for Java, the performance difference tends to be overstated. Overall Java is probably just as fine as C++ for a streaming audio server. If you find that there's a bottleneck in some compute-intensive section, you can always drop down to C++ using Java Native Interface. But that should only be after identifying a problem with profiling.
If you did want to use java, here would be a great place to find some uses and files for using media in java...
I am curious about what automatic methods may be used to determine if a Java app running on a Windows or PC is malware. (I don't really even know what exploits are available to such an app. Is there someplace I can learn about the risks?) If I have the source code, are there specific packages or classes that could be used more harmfully than others? Perhaps they could suggest malware?
Update: Thanks for the replies. I was interested in knowing if this would be possible, and it basically sounds totally infeasible. Good to know.
If it's not even possible to automatically determine whether a program terminates, I don't think you'll get much leverage in automatically determining whether an app does "naughty stuff".
Part of the problem of course is defining what constitutes malware, but the majority is simply that deducing proofs about the behaviour of other programs is surprisingly difficult/impossible. You may have some luck spotting particular patterns, but on the whole you can't be confident (and I suspect it's provably impossible) that you've caught all possible attack vectors.
And in the general sphere, catching 95% of vectors isn't really worthwhile when the attackers simply concentrate on the remaining 5%.
Well, there's always the fundamental philosophical question: what is a malware? It's code that was intended to do damage, or at least code that doesn't do what it claims to. How do you plan to judge intent based on libraries it uses?
Having said that, if you at least roughly know what the program is supposed to do, you can indeed find suspicious packages, things the program wouldn't normally need to access. Like network connections when the program is meant to run as a desktop app. But then the network connection could just be part of an autoupdate feature. (Is autoupdate itself a malware? Sometimes it feels like it is.)
Another indicator is if a program that ostensibly doesn't need any special privileges, refuses to run in a sandbox. And the biggest threat is if it tries to load a native library when it shouldn't need one.
But all these only make sense if you know what the code is supposed to do. An antivirus package might use very similar techniques to viruses, the only difference is what's on the label.
Here is a general outline for how you can bound the possible actions your java application can take. Basically you are testing to see if the java application is 'inert' (can't take harmful actions) and thus it probably not mallware.
This won't necessarily tell you mallware or not, as others have pointed out. The app could still do annoying things like pop-up windows. Perhaps the best indication, is to see if the application is digitally signed by an author you trust; if not -- be afraid.
You can disassemble the class files to determine which Java APIs the application uses; you are looking for points where the java app uses the OS. Since java uses a virtual machine, there are well defined points where a java application could take potentially harmful actions -- these are the 'gateways' to various OS calls (for example opening a socket or reading a file).
Its difficult to enumerate all the APIs, different functions which execute the same OS action should require the same Permission. But java's docs don't provide an exhaustive list.
Does the java app use any native libraries -- if so its a big red flag.
The JVM does not offer the ability to run arbitrary code, or use native system APIs; in particular it does not offer the ability to modify the registry (a typical action of PC mallware). The only way a java application can do this is via native libraries. Typically there is no need for a normal application written in java to use native code (unless it needs to use devices).
Check for System.loadLibrary() or System.load() or Runtime.loadLibrary() or Runtime.load(). This is how the VM loads native libraries.
Does it use the network or file system?
Look for use of java.io, java.net.
Does it make system calls (via Runtime.exec())
You can check for the use of java.lang.Runtime.exec() or ProcessBuilder.exec().
Does it try to control the keyboard / mouse?
You could also run the application in a restricted policy JVM (the instructions/tools for doing this are not as simple as they should be) and see what fails (see Oracle's security tutorial) -- note that disassembly is the only way to be sure, just because the app doesn't do anything harmful once, doesn't mean it won't in the future.
This definitely is not easy, and I was surprised to find how many places one needs to look at (for example several java functions load native libraries, not just one).
Some modern network cards support Direct Memory Access for improved performance. How can I utilize this feature from Java?
Does the JVM provide this automatically, or do I need to do an allocateDirect on the ByteBuffers that I am using to talk to that NIC?
Does anyone have documentation that discusses this?
It is the operating systems task to use the DMA feature of the network card. The JVM does not really care how the OS does it, and simply uses the operating system's functions for talking to "network interfaces".
You cannot do this from inside Java in the typical desktop/server JVMs, as this is operating system area which requires you to reach out into C code. Go have a look on JNI or JNA to see how to do this. Please note that this may make your application brittle if you do not get this exactly right.
Yeah - ankon's answer is right. Java operates in a sandbox - a virtual machine (hence the, "VM" in JVM; Sun actually built ONE physical version -- it's on display somewhere).
Java was never designed (intentionally) to reach outside the sandbox, unlike ActiveX, which can go just about anywhere on a PC.
Just think of all the bad things ActiveX has done over the years via a browser. You wouldn't want that to happen with Java, would you?
Although...
you might be able to instantiate an object in Java that does have access to the hardware (like one of those ActiveX controls, or some DLL, for example - which you'd have to write, too).
The problem I see is the throughput. With 100MB or 1000MB cards, would a JVM (remember, this is a VM running on an OS, so you're a couple of layers removed from the hardware) have the speed to handle what's coming in under load? Would you want a Java program holding up data in your NIC while it tinkered with it (think of the impact to the rest of the system)?
At this point, you're probably better off writing the hard-working guts of your solution in C. And, if you still need Java to play with that data, put it in a place where Java can get to it.
If you're not getting the network throughput you need in java, then you're going to need to write a C wrapper in order to access it.
Have you benchmarked your code to find where your performance issues really are? If you let us know that we can likely help you out without resorting to JNI.