java - analyzing big text files

java - analyzing big text files - java

I need to analyze a log file at runtime with Java.
What I need is, to be able to take a big text file, and search for a certain string or regex within a certain range of lines.
The range itself is deduced by another search.
For example, I want to search the string "operation ended with failure" in the file, but not the whole file, only starting with the line which says "starting operation".
Of course I can do this with plain InputStream and file reading, but is there a library or a tool that will help do it more conveniently?

If the file is really huge, then in your case either good written java or any *nix tool solution will be almost equally slow (it will be bound to IO). In such a case you won't avoid reading the whole file line-by-line.... And in this case few lines of java code would do the job ... But rather than once-off search I'd think about splitting the file at generation time, which might be much more efficient. You could redirect the log file to another program/script (either awk or python would be perfect for it) and split the file on-line/when generated rather than post-factum.

Check this one out - http://johannburkard.de/software/stringsearch/
Hope that helps ;)

Related

Extract part of XML file [duplicate]

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

Use a SAX based parser that presents you with the contents of the document in a stream of events.

StAX API is easier to deal with compared to SAX. Here is a short tutorial

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

Use almost any SAX Parser to stream the file a bit at a time.

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

NTFS Listing All Files and Directories

I'm trying to make a list of all the files and folders on a mounted NTFS Volume, and I made 2 ways to do it so far, all yielding different results (unfortunately).
(NOTE: I couldn't include additional sources here because link limit)
There are a few things I would like cleared up:
(1) How come certain files/folders have weird unrecognizable characters in the middle of the name? and how do I write print them to wstringstream and then how would I properly write them to a wofstream?
Example file path: C:\Users\Rahul\AppData\Local\Packages\winstore_cw5n1h2txyewy\LocalState\Cache\4\4-https∺∯∯wscont.apps.microsoft.com∯winstore∯6.3.0.1∯100∯US∯en-us∯MS∯482∯features1908650c-22a4-485e-8e88-b12d01c84f2f.json.dat
How it appears if you were to use dir in cmd: C:\Users\Rahul\AppData\Local\Packages\winstore_cw5n1h2txyewy\LocalState\Cache\4\4-https???wscont.apps.microsoft.com?winstore?6.3.0.1?100?US?en-us?MS?482?features1908650c-22a4-485e-8e88-b12d01c84f2f.json.dat
How it appears if you were to use wprintf in C++: C:\Users\Rahul\AppData\Local\Packages\winstore_cw5n1h2txyewy\LocalState\Cache\4\4-https
The file name shows properly in windows explorer, but has trouble being printed in cmd. It appears as a box in notepad++, but if you right-click, it shows it properly, so notepad++ can also display the characters properly (sort-of, encoding change maybe?).
I'm currently using (ss is the stringstream, initialized as wstingstream ss("");)
wstringstream ss("");
(my program methods here)
wofstream out("...", wofstream::out);
out << ss.rdbuf();
out.close();
I'm assuming that the encoding has at least something to do with it, but at the same time, I'm not sure which flags to use.
(2) Are all files listed in the MFT?
Every link on NTFS says that all file information and attributes are stored in the MFT, but according to the open source NTFSLib (have a link limit, can be found by googling An-NTFS-Parser-Lib), there are 131840 file records.
When I run my own program, I end up with this 50MB file (includes permissions and the such). My program uses FSCTL_MFT_ENUM_USN_DATA and CreateFile for handles and GetFileInformationByHandle for getting extended information.
CreateFile takes in the WCHAR* normally, and doesn't have the weird null termination issues (I think, maybe, not even sure anymore, this might be where the missing files are).
It shows that there are 129454 files that it could read, I'm assuming that the other 131840-129454=2386 files are files that were deleted but are still in the USN journal.
(3) How come my Java version of the code outputs more file records than the MFT even contains?
The output of my Java code is a 150MB file (includes permissions, enumerates with names instead of symbols because I don't know how to not do that, so it's way bigger).
As you can see here, there are 161430 file records in this one. That's more than what NTFSLib said there are. Yes, it is the case that probably many of those 131840 file records are 'additional names', but I explicitly avoided symlinks in my Java version. Is it the case that those extra 30000 files are generated from hardlinks or somehow having more names is independent from being symlinks?

Solution to (1):
You must write your own library that can write UTF-16, since writing sometimes will run into cases where the characters are misaligned and will think that there is a null, for example:
0xD00A may run into the 0x00 character during a misalign and thus will terminate.
I used the following two files to write out as unicode. Handles wchar_t, wchar_t*, char, char*, unsigned long, and unsigned long long:
UTF16.h,
UTF16.c
(2,3):
Yes, they're all there. You can find the number of links in the GetInformationByHandle method and this will count up to the number of files that the Java one contains.
Still looking for: How do you list the names of all the links to the file record in the MFT?

Java, Query a series of .HTML Help

I'm trying to figure out what is nessary to preform what I belive to be a somewhat simple task, but it seems its execution is a bit advance.
Can someone provide an example that might help me figure out the following goal?
Check various known .html files on local server for a string
If string is Que_for_board the preform a parse of other strings that will be in file
Example: Release data, Author, Program etc
Else (if Que_for_board not found) go to next HTML
Takes results in and print to a file
Is this is hard as it seems? I've looked into HTMLCleaner parser but not sure I need to clean up the HTML into XML, and Im finding it hard to find a query code that has the next step in detail.

I would not describe this as a truly hard task, in that it's really a matter of using a number of techniques that are "out there", but I can see that it could be intimidating.
A technique I find useful is to decompose the overall task into small problems, and teach myself to think only about one problem at a time, having faith that I can assemble the overall solution eventually.
So here you have perhaps
get a list of files from somewhere (where? directory listing, a document?)
Open each file in a list in turn
parse an html file
find a specific string in the parsed file
Of these parsing html files is potentially very easy or not so easy. Can you trust the file to be well-formed or was it written by a human? Humans simply don't make good HTML files and browsers are very tolerant of missing </P> etc.
If these are very simple html files you can fake this by using simple String searching, regex and so on. Otherwise you need a proper parser, and maybe a clean up first.
My first step would be to understand how to process a single HTML file.

Java: Where can I find advanced file manipulation source/libraries?

I'm writing arbitrary byte arrays (mock virus signatures of 32 bytes) into arbitrary files, and I need code to overwrite a specific file given an offset into the file. My specific question is: is there source code/libraries that I can use to perform this particular task?
I've had this problem with Python file manipulation as well. I'm looking for a set of functions that can kill a line, cut/copy/paste, etc. My assumptions are that these are extremely common tasks, and I couldn't find it in the Java API nor my google searches.
Sorry for not RTFM well; I haven't come across any information, and I've been looking for a while now.

Maybe you are looking for something like the RandomAccessFile class in the standard Java JDK. It supports reads and writes at some offset, as well as byte arrays.

Java's RandomAccessFile is exactly what you want.
It includes methods like seek(long) that allow you to move wherever you need in the file. It also allows for reading and writing at the same time.

As far as I know, Java has primarily lower level functions for manipulating files directly. Here is the best I've come up with
The actions you describe are standard in the Swing world, and for text comes down to manipulating a Document object. These act on data in memory. The class java.nio.channels.FileChannel has similar methods that act directly on a file. Neither fine the end of lines automatically, but other classes in java.io and java.nio do.
Apache Commons has a sandbox library called Flatfile which looks like it does what you want. The problem is that no code has been released yet. You may, however, want to talk to people working on it to get some more ideas. I didn't do a general check on libraries.

Have you looked into File/FileReader/FileWriter/BufferedReader? You can get the contents of the files and manipulate it as you like, you can search the data in the files, you can overwrite files, create new, append to an existing....
I am not sure this is exactly what you are asking for but I use these APIs all the time for logging, RTF editors, text file creation for email, and many other things.
As far as cut/copy/past goes, I have not come across the ability to do that directly, however, you can output the contents of the file and "copy" what part of it you want and "paste" it into a new file, or append it to an existing.

While writing a byte array to a file is a common task, writing to a give file 32-bytes byte array just once is just not something you are going to find in java.io :)
To get started, would the below method and comments look reasonable to you? I bet someone here, maybe even myself, could whip it out quick like.
public static void writeFauxVirusSignature(File file, byte[] bytes, long offset) {
//open file
//move to offset
//write bytes
//close file
}
Questions:
How big could the potential target files be?
Do you need performance?
I ask because clean, easy to read code would use Apache Commons lib's, but large file writes in a performance sensitive environment will necessitate using java.nio libraries

Palm Database (PDB) files in Java?

Has anybody written any classes for reading and writing Palm Database (PDB) files in Java? (I mean on a server, not on the Palm device itself.) I tried to google, but all I got were Protein Data Bank references.
I wrote a Perl program that does it using Palm::PDB.pm, but I want to turn it into a servlet for a GWT app.

The jSyncManager project at http://www.jsyncmanager.org/ is under the LGPL and includes classes to read and write PDB files -- look in jSyncManager/API/Protocol/Util/DLPDatabase.java in its source code. It looks like the core code you need from this could be isolated from the rest of the library with a little effort.

There are a few ways that you can go about this;
Easiest but slowest: Find a perl-> java bridge. This will not be quick, but it will work and it should involve the least amount of work.
Find a C++/C# implementation that you have the source to and convert it (this should be the fastest solution)
Find a Java reader ... there seems to be a few listed under google... however I do not have any experience with these.

Depending on what your intended usage is, you might look into writing a simple reader yourself. The format is pretty simple and you only need to handle a couple of simple fields to parse it.
Basically there is a header for the entire file which has a 2 byte integer at the end which specifies the number of record. So just skip your way through the bytes for all the other fields in the header and then read the last field which is the number of records in the file. Be aware that the PDB format writes integers with most significant byte first.
Following this, there will be a record header for each record, the first field of which is the actual offset into the file for the record itself. Again, be aware of the byte order.
So, now you have the offsets into the file for each record in the file, which should make it very easy to read the actual records as long as you know the format of these for the type of PDB file you are trying to read.
Wikipedia has a nice overview of the header formats.

Maybe JPilot can help? They must have a lot of Java code dealing with Palm OS data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java - analyzing big text files - java

Check this one out - http://johannburkard.de/software/stringsearch/ Hope that helps ;)

Related

Extract part of XML file [duplicate]

NTFS Listing All Files and Directories

Java, Query a series of .HTML Help

Java: Where can I find advanced file manipulation source/libraries?

Palm Database (PDB) files in Java?

Categories

Resources