Multiple File Type Parser

Multiple File Type Parser - java

this is my first post here. I'm excited to finally take part.
I'm working on a project where I'm parsing obscure files types. I need to be able to parse word (which I've already done), .sbs, .day, .cmp, and more. All of these types can be opened simply with notepad and displayed.
Since I'm so new to this stuff, is there a way I can use some generic library (or two) to open all of these up? And if so what library would it be?
What's a best practice in this sort of circumstance?
Thanks!

You could use the Apache Commons IO library. FileUtils class has several methods that receives the file path and optionlly the file encoding.
If you just want to only read text files and save them to a text variable
java.io.File file = new java.io.File("C:\\dir\\file.cmp");
String allWordAndLines = org.apache.commons.io.FileUtils.readFileToString(file);
If you want each line separately and store them in a collection:
java.util.List<String> lines = org.apache.commons.io.FileUtils.readLines(file);
for(String line : lines) {
// do something with line
}
To specify the encoding, you need to add another parameter:
org.apache.commons.io.FileUtils.readFileToString(file, "UTF-8");
org.apache.commons.io.FileUtils.readLines(file, "Cp1252");
Java include several classes for read files, see more in http://docs.oracle.com/javase/tutorial/essential/io/index.html
I hope this can help you if you are looking for only to have your text file is available in memory.

Related

Kaitai code writing

I recently started kaitai-struct for dealing with arbitrary binary formats. I have created the .ksy file for my data and parsed it to targeted language that is java. Now can anyone point me how to pass the input file that has the data and how to get the data that is parsed as output so that I can write code to manipulate that data to my requirements? Is there any tutorial on how to write code depending on the data we get.
Thanks in advance.

First you have to generate Java classes from the .ksy file using the Kaitai Struct Compiler or the WebIDE. You can find more information how to use the compiler in the Kaitai user guide.
If you use the WebIDE then just simply right-click on your .ksy file and select the Generate parser > Java menu item.
After you have the generated Java code, you can parse a structure directly from a local file like this:
AnExampleClass output = AnExampleClass.fromFile("an_example.data");
// ... manipulate output ...
Or you can parse a structure from a byte array (byte[]):
AnExampleClass output = new AnExampleClass(new KaitaiStream(byteArray));
// ... manipulate output ...
Note that parsing from non-seekable streams (i.e. FileInputStream, BufferedInputStream, etc) is not supported and probably won’t be supported, as a lot of parsing functionality in KS relies on seek support.
You can read the generic documentation how to use the API here and you can find the Java-specific documentation here.

The answer from koczkatamas is outdated.
There are now specific implementations.
The snippet would be
AnExampleClass output = new AnExampleClass(new ByteBufferKaitaiStream(byteArray));
See this issue for more details

RandomAccessFile equivalent for directory/folder

I have searched web and StackOverflow but did not find anything that would answer this specifically, I found answers like one cannot use RandomAccessFile() for reading a folder - OK, but is there something similar?
I have a piece of a code that originally read from a special type of a file using RandomAcessFile(), but as I have changed a lot of things in the content of that file by extracting them to a folder but there is no way how to pack them back to the original format (it can be extracted with a technique but there is no technique for packing it back) I need to now read it as a directory instead of a file. My problem is that there is a huge amount of classes made around it using RandomAccessFile() output so changing it all just because of this would be a "no-go".
Therefore my question is: what is - if any at all - some kind of equivalent that would output/return something like RandomAccessFile() output but for a directry/folder?
Specifically I would only need to update/change/fix that file variable in this part of the code:
RandomAccessFile file = new RandomAccessFile(lifFile, "r");
long positionOffset = 0;
LIFFile rootFile = parseLIFFile(file, positionOffset);
LIFReader reader = new LIFReader(file, lifFile, rootFile, positionOffset);
return reader;
Can anyone tell/help how to do something like that?
EDIT:
Just a clarification: that special "original" file actually hold quantum of other files like images, 3D geometries, xml files in separate directories...
EDIT 2: So I have already solved it by creating completely new constructor (derivated form the original code) as suggested below - thanx for your thoughts, guys.

Taking an ArrayList and putting in into a text file

Having some issues with my program I am trying to write. Basically what I am needing to do is taking an ArrayList that I have and export the information to a text file. I have tried several different solutions from Google, but none have given me any success.
Basically I have two ArrayList, two lists, one of graduates and one of undergraduates. I simply need to take the information that is associated with these ArrayList and put them into one text file.
I'll later need to do the opposite (import) the .txt file into ArrayList, but I can figure that out later.
Any suggestions?

If you need to write the data in a specific format, you could use a PrintWriter to write the data to a file in whatever manner you wish. The problem with this is that you will then have to figure out a way in which you will then re-read the text file and populate the data.
On the other hand, you could use XStream(tutorial here) to write your files as XML. This will provide you with a human readable text file (as above) however, it will be much easier to re-read the text file when populating the data.
Lastly, you could use the ObjectOutputStream to write the data and the ObjectInputStream to re-read it back. Note however, that this method does not yield a human readable text file. Also, your classes will need to implement the Serializable interface.

Here's a solution using Apache commons-io library:
//Put all data into one big list, prepended with size of first list
List<String> allData = new ArrayList<String>(1+grads.size()+undergrads.size());
allData.add(String.valueOf(grads.size());
allData.addAll(grads);
allData.addAll(undergrads);
FileUtils.writeLines(new File("list.txt"), allData);
To read the data back:
List<String> allData = FileUtils.readLines(new File("list.txt"));
int gradsSize = Integer.parseInt(allData.get(0));
List<String> grads = allData.subList(1, gradsSize+1);
List<String> undergrads = allData.subList(1+gradsSize, allData.size());

Read and a write a file in a reverse order - Java

I have a very big file (might be even 1G) that I want to create a new file from in a reversed order (in Java).
For example:
Original file:
This is the first line
This is the 2nd line
This is the 3rd line
The reversed file:
This is the 3rd line
This is the 2nd line
This is the first line
Since the file is very big, loading the entire file to memory at once and reversing the order there might be problematic (there is a limit to the memory I can use).
How can I achieve this in Java?
Thanks

Nothing very direct, I'm afraid. But you can easily create some (say) ReverseBufferedRead class wrapping a RandomAccessFile.
See also here.

Read the file by chunks of few hundreds lines, reverse the order of lines in the chunks and write them to temporary files. Then join the temporary files in the reverse order and clean up.
In other words, use disk instead of memory.

I would propose making a RandomAccessFile for the output and using setLength() to make it appropriately sized.
Then start scanning the original file and write it out in chunks starting at the end of the RandomAccessFile in reverse.
Java-ish Pseudo:
out.seek(size_of_out_file); //seek to end
RandomAccessFile out = new RandomAccessFile("out_fname", "rw");
out.setLength(size_of_file_to_be_reversed)
File in = new File ("in_fname");
while (hasMoreData(in)){
String chunk = in.readsize();
out.seekBackwardsBy(chunk.length());
out.write(chunk.reverse);
out.seekBackwardsBy(chunk.length());
}

Reading a file line-by-line in reverse order is fundamentally tricky.
It's not too bad if you've got a fixed width encoding. It's feasible if you've got a variable width encoding which you can detect the first byte of etc (e.g. UTF-8). It's virtually impossible to do efficiently if the encoding is variable width with no sensible way of determining boundaries (or if it uses "shifting" for example).
I have an implementation in C# in another question, but it would take a fair amount of effort to port that to Java.

If you use the RandomAccessFile like leonbloy suggested you can use a FileChannel
to skip to the end of the file, you can then read the line and write it to another file.
There is a simple example here in the Java tutorials: example

I would assume you know how to read a file. One way i would advise you do it is with an ArrayList of generic type string. So you read each line of the file and store it in that list. After reading you print the list out or do whatever you want to.
Just wrote something that might be of help here : http://pastebin.com/iWTVrAvm

Read using RandomAccessFile - position the file using randomAccesFile.length()and write using BufferedWriter

A better solution is use a ReversedLinesFileReader provided in Apache Commons IO package. Look at the API here https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/ReversedLinesFileReader.html

Rename Pdf from Pdf title

I want to organize my pdf file downloaded from the internet. It is clear that many of them are ill-named. I want to extract the real title from the file. Here many of them are generated from Latex and I think from the compiled pdf we can find the \title{} keyword or something like that. I want then use this to rename the file.
I can read the meta-data using pypdf. But most pdf does not contains that title in its metadata. I have tried it with all my collections and find none!
Two questions:
1. Is it possible to read pdf title compiled from the pdf compiled from latex.
2. Which library(mainly in C/C++, java, python) can I use to get that information.
Thanks in advance.

I think this is not really possible. The LaTeX information is no longer present in the pdf. If the title is not present in the metadata, you might be able to deduce the title from the structure information if it is a "tagged pdf". Most pdfs aren't however, and those that are will probably provide the metadata anyway.
This leaves you with layout analysis: try to determine what is the title from the document by looking at layout characteristics. For python, you might want to have a look at pdfminer.
The following example uses pdfminer to determine the title using a rather simplistic approach:
we assume that the title is somewhere on the first page
we leave it to pdfminer to recognize "blocks of text" on the first page
we assume that the title is printed "bigger" than the rest of the page. Looking at the height of each line in the text blocks, we determine which block contains the "tallest" line, and assume that that block contains the title
we let pdfminer extract the text from the block,
the text will probably contain newlines (placed by pdfminer) because the title might contain more than one line, and other needless whitespace, so we do some simple whitespace normalization (replace consecutive whitespace by a single space, and strip leading and trailing whitespace), and that's it!
As I said: this approach is rather simplistic, and might or might not give good results for your documents, but it may point you in the right direction. Here it goes:
import sys
import re
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox
filename = sys.argv[1]
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interp = PDFPageInterpreter(rsrcmgr, device)
pages = doc.get_pages()
first_page = pages.next()
interp.process_page(first_page)
layout = device.get_result()
textboxes = [i for i in layout if isinstance(i, LTTextBox)]
box_with_tallest_line = max(textboxes, key=lambda x: max(i.height for i in x))
text = box_with_tallest_line.get_text()
print re.sub('\s+', ' ', text).strip()
I'll leave renaming the file to you (note that the title might contain characters that you might not want, or that are not even valid in filenames). Pdfminer documentation is rather sparse at the moment, so you might want to ask on the mailing list if you need to know more. (don't know very much about it myself, but couldn't resist trying ;-)). Or you might try a similar approach with other pdf libraries/other languages.

In python, your best bet is to look at pyPdf (Debian package: python-pypdf). Here's some code:
import pyPdf, sys
filename=sys.argv[1]
i=pyPdf.PdfFileReader(open(filename,"rb"))
d=i.getDocumentInfo()
print d["/Title"]
In my experience, few PDFs have the "/Title" attribute set, though, so your mileage may vary. In that case, you'll have to guess the title from the contents, which is bound to be error-prone. pyPdf may help you with that as well.

Try iText (Java). I found this example, try it (you may add generics, if supported):
PdfReader reader = new PdfReader("yourpdf.pdf");
HashMap map= reader.getInfo();
Set keys = map.keySet();
Iterator i = keys.iterator();
while(i.hasNext()) {
String thiskey = (String)i.next();
System.out.println(thiskey + ":" + (String)map.get(thiskey));
}

Another option for C++ is Poppler.
I tried to do something similar in the past (and was asking advice here:
Extracting text from PDF with Poppler (C++) ) but never really got it working. At the end of the day I realised that at least for my use, it was easier to manually rename the files.

The best solution I found for renamin PDF files using not jus the tittle, but any text you need in the pdf file is the A-PDF rename app, it worked very well for all files I tried.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Multiple File Type Parser - java

Related

Kaitai code writing

RandomAccessFile equivalent for directory/folder

Taking an ArrayList and putting in into a text file

Read and a write a file in a reverse order - Java

Rename Pdf from Pdf title

Categories

Resources