integrating an external program - java

So I have been tasked with integrating a program called "lightSIDE" into a hadoop job, and I'm having some trouble figuring out how to go about this.
So essentially, rather than a single JAR, lightSIDE comes as an entire directory, including xml files that are crucial to its running.
Up until now, the way the data scientists on my team have been using this program is by running a python script that actually runs an executable, but this seems extremely inefficient as it would be spinning up a new JVM every time it gets called. That being said, I have no idea how else to handle this.

If you are writing your own MapReduce jobs then it is possible to include all the jar files as as libraries and xml files as resources.

I'm one of the maintainers for the LightSide Researcher's Workbench. LightSide also includes a tiny PredictionServer class to handle predictions on new instances over HTTP - you can see it here on BitBucket.
If you want to train new models instead, you could modify this server to do what you want, drawing clues from the side.recipe.Chef class.

Related

How to retrieve real-time data from an existing Java application without access to source code?

Update: Jun 10, 2022
I have successfully been able to create a demo application with AspectJ integration that could extract variables from the demo application. It was quite a hassle since there's a bit of trouble going on with Eclipse AJDT integration.
I was able to use CLI Java and ajc (AspectJ compiler) to achieve binary weaving into my demo application.
Original Question:
I am trying to retrieve real-time data from a running Java application and push it into an API I have on a server.|
I have no access to the source code of the running application; I only have the Jar file. I have tried decompilation into .java files; however, due to the scale of the app, I was not able to fix all of the missing access$000 function calls.
Is there a certain approach I should use when retrieving real-time data from an existing Java application? Has that been done before? Am I missing something that I am not aware of?
Any help is appreciated.
This is big challenge obviously. If you can glean enough understanding of how the program works from decompiling and reading log files to target some methods where you suspect there's data of interest to your API, then I would read up about Aspect Oriented Programming [AOP] and use those tools.
With AOP you can modify the classes in the jar file at runtime as its loaded by the JVM and access the classes.
For example: You can gather data from:
fields within the class that owns a method
parameters passed to a method
value returned from a method
Once you gather the data, you can also insert calls to your API.
Here's a place to start - https://www.baeldung.com/aspectj .

Apache Spark - run extermal exe or jar file parallel

I have .exe file (I don't have source files so I won't be able to edit the program) taking as parameter path to file which be processing and on the end giving results. For example in console I run this program as follow : program.exe -file file_to_process [other_parametrs]. I have also jar executable file which take two parameters file_to_process and second file and [others_parameters]. In both cases I would like to split input file into smallest part and run programs in parallel. Is there any way to do it efficient with Apache Spark Java framework. I'm new with parallel computations and I read about RDD and pipe operator but I don't know if it would be good in my case because I have path to file.
I will be very grateful for some help or tips.
I have run into similar issues recently, and I have a working code with spark 2.1.0. The basic idea is that, you put your exe with its dependencies such as dll into HDFS or your local and use addFiles to add them into driver, which will also copy them into work executors. Then you can load your file as a RDD, and use mapPartitionsWithIndex function to save each partition into local and execute the exe (use SparkFiles.get to get the path from the work executor) to that partition using Process.
Hope that helps.
I think the general answer is "no". Spark is a framework and in general it administers very specific mechanisms for cluster configuration, shuffling its own data, read big inputs (based on HDFS), monitoring task completion and retries and performing efficient computation. It is not well suited for a case where you have a program you can't touch and that expects a file from the local filesystem.
I guess you could put your inputs on HDFS, then, since Spark accepts arbitrary java/Scala code, you could use whatever language facilities you have to dump to a local file, launch a process (i.e.this), then build some complex logic to monitor for completion (maybe based on the content of the output). the mapPartitions() Spark method would be the one best suited for this.
That said, I would not recommend it. It will be ugly, complex, require you to mess with permissions on the nodes and things like that and would not take good advantage of Spark's strengths.
Spark is well suited for you problem though, especially if each line of your file can be processed independently. I would look to see if there is a way to get the program's code, a library that does the same or if the algorithm is trivial enough to re-implement.
Probably not the answer you were looking for though :-(

Functional/regression testing for Java Applications working with files

I'm trying to find the best way to create automated testing for functional/acceptance/regression testing for some java applications. All the applications work in this way:
They read a File from a given folder
They write a new file in another format with the content of the input file.
They send to database some of the information of processed files.
They wait until a new file is left in the input folder.
This is a cyclic application, it never stops.
New files/formats are added continuously and several of our libraries are shared by all the formats. Manual testing is taking more and more cost with each new format. All the files are plain text files but with different format in the way data is saved.
We need a way/tool that could help us to automated the functional/acceptance/regression tests (specially QA tests).
The question is: What tool/way of testing can be used for this?
I was thinking in something that can left files in the input folder and compare what the application creates in output folder with an expected file. I donĀ“t know if this can be done easily with a tool or if we have to make all of this entirely.
I would use a generic functional test automation framework and use a set of libraries to read/parse/compare files. I am familiar with Robot Framework and there are some Python Libraries to read/compare files (some embedded in Robot itself, some elsewhere). That is very convenient and quite easy to use for QA Tests. Check out the demo project for a good start.
If you prefer to stick in the Java ecosystem, you might want to try Cucumber-jvm or JBehave.

Command line utility to run code in a web app

I just got a requirement to create a small (I assume standalone) utility to hit some code in our web application to do some custom processing of files from the app and then dump the files into a shared drive. My question is what is the best way for doing this? Do I just create a small app and then jar it up and run it off a command line or is there a better way?
Sorry, I didn't give enough detail. It's an old application, like over 10 years, so while it's been upgraded to jdk 1.6, most of the code uses the old collections, old loops, etc... There aren't any interfaces, very tightly coupled code that uses inheritance with lots of nested objects. The web app will do the processing. I think what they want is create some code outside of the application code that will login and then fire off the file processing code. Prior to this I had upgraded their version of Windward Reports in a separate branch and they want to make sure that the processed files: contracts, forms, etc.. don't get altered greatly as there are legal requirements on fonts and layouts. So this utility will go in, fire off the list of reports (a few thousand) dump it to a share drive so they can view them with another tool for comparision based on rules you can automate with that commercial tool, en masse. I was thinking create a small class with a main method, then jar it up and while the web server is running with my upgraded branch code, run the utility off the command line to fire it off.
There's not enough to go on here. How is the web app's functions exposed? If it's a REST interface then wget/curl/spring-rest-template are the way to go. If it's something like a JFS app then you're going to need something like Selenium to imitate a browser. If the functionality is in a shared library (JAR) then there web never even comes into play.
Well, I was originally looking at creating a standalone utility jar that I would run off the command line to connect with URLConnection to the app, but I found there is already testing code built into the application that I can run from a command line as long as I deploy the new code with the existing code. The utility will dump out the files to a shared drive and then XTest can be run to compare files. After reviewing the capabilities of XTest, it appears that it can handle the comparison of files well.

Patching Java software

I'm trying to create a process to patch our current java application so users only need to download the diffs rather than the entire application. I don't think I need to go as low level as a binary diff since most of the jar files are small, so replacing an entire jar file wouldn't be that big of a deal (maybe 5MB at most).
Are there standard tools for determining which files changed and generating a patch for them? I've seen tools like xdelta and vpatch, but I think they work at a binary level.
I basically want to figure out - which files need to be added, replaced or removed. When I run the patch, it will check the current version of the software (from a registry setting) and ensure the patch is for the correct version. If it is, it will then make the necessary changes. It doesn't sound like this would be too difficult to implement on my own, but I was wondering if other people had already done this. I'm using NSIS as my installer if that makes any difference.
Thanks,
Jeff
Be careful when doing this--I recommend not doing it at all.
The biggest problem is public static variables. They are actually compiled into the target, not referenced. This means that even if a java file doesn't change, the class must be recompiled or you will still refer to the old value.
You also want to be very careful of changing method signatures--you will get some very subtle bugs if you change a method signature and do not recompile all files that call that method--even if the calling java files don't actually need to change (for instance, change a parameter from an int to a long).
If you decide to go down this path, be ready for some really hard to debug errors (generally no traces or significant indications, just strange behavior like the number received not matching the one sent) on customer site that you cannot duplicate and a lot of pissed off customers.
Edit (too long for comment):
A binary diff of the class files might work but I'd assume that some kind of version number or date gets compiled in and that they'd change a little every compile for no reason but that could be easily tested.
You could take on some strict development practices of not using public final statics (make them private) and not every changing method signatures (deprecate instead) but I'm not convinced that I know all the possible problems, I just know the ones we encountered.
Also binary diffs of the Jar files would be useless, you'd have to diff the classes and re-integrate them into the jars (doesn't sound easy to track)
Can you package your resources separately then minimize your code a bit? Pull out strings (Good for i18n)--I guess I'm just wondering if you could trim the class files enough to always do a full build/ship.
On the other hand, Sun seems to do an okay job of making class files that are completely compatible with the previous JRE release, so they must have guidelines somewhere.
You may want to see if Java WebStart can help you as it is designed to do exactly those things you want to do.
I know that the documentation describes how to create and do incremental updates, but we deploy the whole application as it changes very rarely. It is then an issue of updating the JNLP when ready.
How is it deployed?
On a local network I just leave everything as .class files in a folder. The startup script uses robocopy or rsync to copy from network share to local. If any .class file is different it is synced down. If not, it doesn't sync.
For non-local network I created my own updater. It downloads a text file of md5sums and compares to local files. If different it pulls file down from http.
A long time ago the way we solved this was to used Classpath and jar files. Our application was built in a Jar file, and it had a launcher Jar file. The launcher classpath had a patch.jar that was read into the classpath before the main application.jar. This meant that we could update the patch.jar to supersede any classes in the main application.
However, this was a long time ago. You may be better using something like the Java Web Start type of approach, which offers more seamless application updating.

Categories