Processing large file using java

Processing large file using java - java

Updated:
My requirement is to process files with huge content. I need to apply multiple business rules on the file content. The business rule may be applicable on the whole content of the file. For example based on the status of a column, it becomes eligible for a business rule. Outcome of one business rule will be eligible for another business rule.
Another requirement is to apply quality checks on the incoming data in the form of files. In many cases, i see that i have to store large content in memory for processing.
I was going through this article which was explaining on how to process a large file using java.nio package. I found this very interesting and thought of trying this code.
Unfortunately, the code is not executable. Can somebody help me in sharing/making the executable code for this? Clues of how to make this executable are also welcome.
Issues i found are:
The method closeQuietly(InputStream) in the type Closeables is not applicable for the arguments (FileChannel)
Could not figure out what should be the Timestamped implementation (blog claims this file is not shown there)!
TrueFxDecoder and TrueFxData are missing!! A dummy implementation reference will be of great help.
Libraries used: JavaSE-1.7, Guava-17.0.jar
I believe, this executable code definitely useful to many other people who are in need of this kind of requirement.

In my experience, BufferedInputStream and BufferedOutputStream are great for processing large files, since these streams don't load the whole file into memory, they use an internal buffer. This tutorial you have chosen is quite complex and confusing.

Related

How to handle multiple source files during semantic analysis?

I'm trying to implement an external DSL using a handcrafted compiler. I'm done with lexing and parsing, however currently I'm lost with regards to resolving symbols from separate files (i.e. inheritance).
I've tried searching for this but nothing comes up relating to the handling on the compiler level. However, I've stumbled upon object files, linkers and loaders, but upon further research they seem to play a role after compilation rather than during.
Thank you to anyone who could help.

This is heavily dependent on the nature of your language.
If you're C-like, you have header files defining the shared symbols, and you #include the header every place it is used (recompiling the header as part of that file).
If you're Java-like, you have a standard naming convention and package/directory hierarchy to allow you to locate a symbol.
If you're Javascript-like, you don't resolve any symbols at compile time; you just throw an error if a symbol is not defined when it is used. (For small scripting languages this is often the simplest answer)
If the total amount of code in the DSL is small, a third option is just to load and parse it all and then do the symbol-resolving pass on the whole thing at once.

Using JCR for a large amount of configuration files

My program uses a lot of small serializable configuration files that are loaded when the program starts (around 10,000 1-2K binary files).
The configuration files are stored in a zip file that is backed up on a remote machine.
The background:
When the program starts, it unzips new content from the remote machine if such exists.
Sometimes, when a lot of new content is available, loading times may take around a minute or two.
I've checked the program flow with JVisualVM and found that most of the time is wasted on IO actions (unzipping, loading serializable files..).
I have a few ideas of working with the zip without unzipping it, and cutting unrequired meta-data. With all those changes, my tests yielded loading times of 20-30 seconds which is ok.
The question: Recently I read about JCR, which sounds like a reasonable solution for my situation. On on hand, I prefere using an accepted wide-known solution like JCR than using a custom implementation of my own. On the other hand, I'm worried that JCR implementations won't be as efficient as my custom implementation (which isn't complete yet).
Are there any recommendations for JCR implementations that may be efficient for such situation? I'd love to hear your opinion.
Thanks in advance.

Using Java to test for file damage and corruption

I am looking at writing a program that can test files for corruption and/or damage. I would prefer to write the program in Java.
Now for the tricky part, is it possible to use Java to test for files corruption/damage in many different file types? I am mainly looking at checking .pdf .html and .txt files, but I fear that more files could be added onto the list soon. I honestly have no idea if this is even possible to write or not. If Java can not do this is it possible to do it with C?

I think you are going to have to take it file by file basis. For example
text files - make sure that you can read the file using FileReader
html - make sure it is a text file AND that the HTML file is valid
pdf - use a pdf generator to see if you can read the pdf and it is valid
But as alex has suggest, it doesn't matter if you do this in java. As long as you can read bytes you can check.
You also have to define corruption. If by corruption you mean correct disk blocks on the HD then you might need a lower level programming language. If you mean all the bytes represent correct data then you can do this in any language.

You will first need to define "corruption". If you can assume that a file is in good shape as long as you can open it, read its content, confirm its file permissions, and confirm that it is not empty, that's doable in java via the java io API.
If your definition of a valid file includes more rules, such as HTML files needing to be in valid XML form, and PDFs need to be correct/complete, then your program will get more interesting based on your requirements. For PDFs, you can use iText to read them and get their meta data:
http://itextpdf.com/

Files can always be seen as collection of bytes that Java can read. If you have an algorithm to check for corruption, nothing prevents you from implementing it in Java.
And using some good design patterns can make it easy to support different file types.

Acrobat has some fairly powerful repair capabilities so it repairs and opens many broken files. The spec is also quite loosely interpreted (for example TT fonts are supposed to be MAC encoded but in practise WIN encoding works).

Automatically generating Java source code

I'm looking for a way to automatically generate source code for new methods within an existing Java source code file, based on the fields defined within the class.
In essence, I'm looking to execute the following steps:
Read and parse SomeClass.java
Iterate through all fields defined in the source code
Add source code method someMethod()
Save SomeClass.java (Ideally, preserving the formatting of the existing code)
What tools and techniques are best suited to accomplish this?
EDIT
I don't want to generate code at runtime; I want to augment existing Java source code

What you want is a Program Transformation system.
Good ones have parsers for the language you care about, build ASTs representing the program for the parsed code, provide you with access to the AST for analaysis and modification, and can regenerate source text from the AST. Your remark about "scanning the fields" is just a kind of traversal of the AST representing the program. For each interesting analysis result you produce, you want to make a change to the AST, perhaps somewhere else, but nonetheless in the AST.
And after all the chagnes are made, you want to regenerate text with comments (as originally entered, or as you have constructed in your new code).
There are several tools that do this specifically for Java.
Jackpot provides a parser, builds ASTs, and lets you code Java procedures to do what you want with the trees. Upside: easy conceptually. Downside: you write a lot more Java code to climb around/hack at trees than you'd expect. Jackpot only works with Java.
Stratego and TXL parse your code, build ASTs, and let you write "surce-to-source" transformations (using the syntax of the target language, e.g., Java in this case) to express patterns and fixes. Additional good news: you can define any programming language you like, as the target language to be processed, and both of these have Java definitions.
But they are weak on analysis: often you need symbol tables, and data flow analysis, to really make analyses and changes you need. And they insist that everything is a rewrite rule, whether that helps you or not; this is a little like insisting you only need a hammer in toolbox; after all, everything can be treated like a nail, right?
Our DMS Software Reengineering Toolkit allows the definition of an abitrary target language (and has many predefined langauges including Java), includes all the source-to-source transformation capabilities of Stratego, TXL, the procedural capability of Jackpot,
and additionally provides symbol tables, control and data flow analysis information. The compiler guys taught us these things were necessary to build strong compilers (= "analysis + optimizations + refinement") and it is true of code generation systems too, for exactly the same reasons. Using this approach you can generate code and optimize it to the extent you have the knowledge to do so. One example, similar to your serialization ideas, is to generate fast XML readers and writers for specified XML DTDs; we've done that with DMS for Java and COBOL.
DMS has been used to read/modify/write many kinds of source files. A nice example that will make the ideas clear can be found in this technical paper, which shows how to modify code to insert instrumentation probes: Branch Coverage Made Easy.
A simpler, but more complete example of defining an arbitrary lanauges and transformations to apply to it can be found at How to transform Algebra using the same ideas.

Have a look at Java Emitter Templates. They allow you to create java source files by using a mark up language. It is similar to how you can use a scripting language to spit out HTML except you spit out compilable source code. The syntax for JET is very similar to JSP and so isn't too tricky to pick up. However this may be an overkill for what you're trying to accomplish. Here are some resources if you decide to go down that path:
http://www.eclipse.org/articles/Article-JET/jet_tutorial1.html
http://www.ibm.com/developerworks/library/os-ecemf2
http://www.vogella.de/articles/EclipseJET/article.html

Modifying the same java source file with auto-generated code is maintenance nightmare. Consider generating a new class that extends you current class and adds the desired method. Use reflection to read from user-defined class and create velocity templates for the auto-generating classes. Then for each user-defined class generate its extending class. Integrate the code generation phase in your build lifecycle.
Or you may use 'bytecode enhancement' techniques to enhance the classes without having to modify the source code.
Updates:
mixing auto-generated code always pose a risk of someone modifying it in future to just to tweak a small behavior. It's just the matter of next build, when this changes will be lost.
you will have to solely rely on the comments on top of auto-generated source to prevent developers from doing so.
version-controlling - Lets say you update the template of someMethod(), now all of your source file's version will be updated, even if the source updates is auto-generated. you will see redundant history.

You can use cglib to generate code at runtime.

Iterating through the fields and defining someMethod is a pretty vague problem statement, so it's hard to give you a very useful answer, but Eclipse's refactoring support provides some excellent tools. It'll give you constructors which initialize a selected set of the defined members, and it'll also define a toString method for you.
I don't know what other someMethod()'s you'd want to consider, but there's a start for you.

I'd be very wary of injecting generated code into files containing hand-written code. Hand-written code should be checked into revision control, but generated code should not be; the code generation should be done as part of the build process. You'd have to structure your build process so that for each file you make a temporary copy, inject the generated source code into it, and compile the result, without touching the original source file that the developers work on.

Antlr is really a great tool that can be used very easily for transforming Java source code to Java source code.

How to efficiently manage files on a filesystem in Java?

I am creating a few JAX-WS endpoints, for which I want to save the received and sent messages for later inspection. To do this, I am planning to save the messages (XML files) into filesystem, in some sensible hierarchy. There will be hundreds, even thousands of files per day. I also need to store metadata for each file.
I am considering to put the metadata (just a couple of fields) into database table, but the XML file content itself into files in a filesystem in order not to bloat the database with content data (that is seldomly read).
Is there some simple library that helps me in saving, loading, deleting etc. the files? It's not that tricky to implement it myself, but I wonder if there are existing solutions? Just a simple library that already provides easy access to filesystem (preferrably over different operating systems).
Or do I even need that, should I just go with raw/custom Java?

Is there some simple library that
helps me in saving, loading, deleting
etc. the files? It's not that tricky
to implement it myself, but I wonder
if there are existing solutions? Just
a simple library that already provides
easy access to filesystem (preferrably
over different operating systems).
Java API
Well, if what you need to do is really simple, you should be able to achieve your goal with java.io.File (delete, check existence, read, write, etc.) and a few stream manipulations with FileInputStream and FileOutputStream.
You can also throw in Apache commons-io and its handy FileUtils for a few more utility functions.
Java is independent of the OS. You just need to make sure you use File.pathSeparator, or use the constructor File(File parent, String child) so that you don't need to explicitly mention the separator.
The Java file API is relatively high-level to abstract the differences of the many OS. Most of the time it's sufficient. It has some shortcomings only if you need some relatively OS-specific feature which is not in the API, e.g. check the physical size of a file on the disk (not the the logical size), security rights on *nix, free space/quota of the hard drive, etc.
Most OS have an internal buffer for file writing/reading. Using FileOutputStream.write and FileOutputStream.flush ensure the data have been sent to the OS, but not necessary written on the disk. The Java API support also this low-level integration to manage these buffering issue (example here) for system such as database.
Also both file and directory are abstracted with File and you need to check with isDirectory. This can be confusing, for instance if you have one file x, and one directory /x (I don't remember exactly how to handle this issue, but there is a way).
Web service
The web service can use either xs:base64Binary to pass the data, or use MTOM (Message Transmission Optimization Mechanism) if files are large.
Transactions
Note that the database is transactional and the file system not. So you might have to add a few checks if operations fails and are re-tried.
You could go with a complicated design involving some form of distributed transaction (see this answer), or try to go with a simpler design that provides the level of robustness that you need. A possible design could be:
Update. If the user wants to overwrite a file, you actually create a new one. The level of indirection between the logical file name and the physical file is stored in database. This way you never overwrite a physical file once written, to ensure rollback is consistent.
Create. Same story when user want to create a file
Delete. If the user want to delete a file, you do it only in database first. A periodic job polls the file system to identify files which are not listed in database, and removes them. This two-phase deletes ensures that the delete operation can be rolled back.
This is not as robust as writting BLOB in real transactional database, but provide some robustness. You could otherwise have a look at commons-transaction, but I feel like the project is dead (2007).

There is DataNucleus, a Java persistence provider. It is little too heavy for this case, but it supports JPA and JDO java standards with different datastores (RDBMS, object storage, XML, JSON, Excel, etc.). If the product is already using JPA or JDO, it might be worth considering using NataNucleus, as saving data into different datastores should be transparent. I suppose DataNucleus supports splitting the data into several files, creating the sensible directory/file structure I wanted (in my question), but this is just a guess.
Support for XML and JSON seems to be experimental.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.