I want to organize my pdf file downloaded from the internet. It is clear that many of them are ill-named. I want to extract the real title from the file. Here many of them are generated from Latex and I think from the compiled pdf we can find the \title{} keyword or something like that. I want then use this to rename the file.
I can read the meta-data using pypdf. But most pdf does not contains that title in its metadata. I have tried it with all my collections and find none!
Two questions:
1. Is it possible to read pdf title compiled from the pdf compiled from latex.
2. Which library(mainly in C/C++, java, python) can I use to get that information.
Thanks in advance.
I think this is not really possible. The LaTeX information is no longer present in the pdf. If the title is not present in the metadata, you might be able to deduce the title from the structure information if it is a "tagged pdf". Most pdfs aren't however, and those that are will probably provide the metadata anyway.
This leaves you with layout analysis: try to determine what is the title from the document by looking at layout characteristics. For python, you might want to have a look at pdfminer.
The following example uses pdfminer to determine the title using a rather simplistic approach:
we assume that the title is somewhere on the first page
we leave it to pdfminer to recognize "blocks of text" on the first page
we assume that the title is printed "bigger" than the rest of the page. Looking at the height of each line in the text blocks, we determine which block contains the "tallest" line, and assume that that block contains the title
we let pdfminer extract the text from the block,
the text will probably contain newlines (placed by pdfminer) because the title might contain more than one line, and other needless whitespace, so we do some simple whitespace normalization (replace consecutive whitespace by a single space, and strip leading and trailing whitespace), and that's it!
As I said: this approach is rather simplistic, and might or might not give good results for your documents, but it may point you in the right direction. Here it goes:
import sys
import re
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox
filename = sys.argv[1]
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interp = PDFPageInterpreter(rsrcmgr, device)
pages = doc.get_pages()
first_page = pages.next()
interp.process_page(first_page)
layout = device.get_result()
textboxes = [i for i in layout if isinstance(i, LTTextBox)]
box_with_tallest_line = max(textboxes, key=lambda x: max(i.height for i in x))
text = box_with_tallest_line.get_text()
print re.sub('\s+', ' ', text).strip()
I'll leave renaming the file to you (note that the title might contain characters that you might not want, or that are not even valid in filenames). Pdfminer documentation is rather sparse at the moment, so you might want to ask on the mailing list if you need to know more. (don't know very much about it myself, but couldn't resist trying ;-)). Or you might try a similar approach with other pdf libraries/other languages.
In python, your best bet is to look at pyPdf (Debian package: python-pypdf). Here's some code:
import pyPdf, sys
filename=sys.argv[1]
i=pyPdf.PdfFileReader(open(filename,"rb"))
d=i.getDocumentInfo()
print d["/Title"]
In my experience, few PDFs have the "/Title" attribute set, though, so your mileage may vary. In that case, you'll have to guess the title from the contents, which is bound to be error-prone. pyPdf may help you with that as well.
Try iText (Java). I found this example, try it (you may add generics, if supported):
PdfReader reader = new PdfReader("yourpdf.pdf");
HashMap map= reader.getInfo();
Set keys = map.keySet();
Iterator i = keys.iterator();
while(i.hasNext()) {
String thiskey = (String)i.next();
System.out.println(thiskey + ":" + (String)map.get(thiskey));
}
Another option for C++ is Poppler.
I tried to do something similar in the past (and was asking advice here:
Extracting text from PDF with Poppler (C++) ) but never really got it working. At the end of the day I realised that at least for my use, it was easier to manually rename the files.
The best solution I found for renamin PDF files using not jus the tittle, but any text you need in the pdf file is the A-PDF rename app, it worked very well for all files I tried.
Related
I need to modify a file. We've already written a reasonably complex component to build sets of indexes describing where interesting things are in this file, but now I need to edit this file using that set of indexes and that's proving difficult.
Specifically, my dream API is something like this
//if you'll let me use kotlin for a second, assume we have a simple tuple class
data class IdentifiedCharacterSubsequence { val indexOfFirstChar : int, val existingContent : String }
//given these two structures
List<IdentifiedCharacterSubsequences> interestingSpotsInFile = scanFileAsPerExistingBusinessLogic(file, businessObjects);
Map<IdentifiedCharacterSubsequences, String> newContentByPreviousContentsLocation = generateNewValues(inbterestingSpotsInFile, moreBusinessObjects);
//I want something like this:
try(MutableFile mutableFile = new com.maybeGoogle.orApache.MutableFile(file)){
for(IdentifiedCharacterSubsequences seqToReplace : interestingSpotsInFile){
String newContent = newContentByPreviousContentsLocation.get(seqToReplace);
mutableFile.replace(seqToReplace.indexOfFirstChar, seqtoReplace.existingContent.length, newContent);
//very similar to StringBuilder interface
//'enqueues' data changes in memory, doesnt actually modify file until flush call...
}
mutableFile.flush();
// ...at which point a single write-pass is made.
// assumption: changes will change many small regions of text (instead of large portions of text)
// -> buffering makes sense
}
Some notes:
I cant use RandomAccessFile because my changes are not in-place (the length of newContent may be longer or shorter than that of seq.existingContent)
The files are often many megabytes big, thus simply reading the whole thing into memory and modifying it as an array is not appropriate.
Does something like this exist or am I reduced to writing my own implementation using BufferedWriters and the like? It seems like such an obvious evolution from io.Streams for a language which typically emphasizes indexed based behaviour heavily, but I cant find an existing implementation.
Lastly: I have very little domain experience with files and encoding schemes, so I have taken no effort to address the 'two-index' character described in questions like these: Java charAt used with characters that have two code units. Any help on this front is much appreciated. Is this perhaps why I'm having trouble finding an implementation like this? Because indexes in UTF-8 encoded files are so pesky and bug-prone?
I'm writing a parser for files that look like this:
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
I want to get information preceded by certain tags (DEFINITION, VERSION etc.) but some descriptions cover multiple lines and I do need all of it. This is a problem when using BufferdReader to read my file.
I almost figured it out by using mark() and reset() but when executing my program I noticed that it only works for one tag and other tags are somehow skipped. This is the code I have so far:
Pattern pTag = Pattern.compile("^[A-Z]{2,}");//regex: 2 or more uppercase letters is a tag
Matcher mTagCurr = pTag.matcher(line);
if (mTagCurr.find()) {
reader.mark(1000);
String nextLine = reader.readLine();
Matcher mTagNext = pTag.matcher(nextLine);
if (mTagNext.find()){
reader.reset();
continue;
}
Pattern pWhite = Pattern.compile("^\\s{6,}");
Matcher mWhite = pWhite.matcher(nextLine);
while (mWhite.find()) {
line = line.concat(nextLine);
}
System.out.println(line);
}
This piece of code is supposed to find tags and concatenate descriptions that cover more than one line. Some answers I found here advised using Scanner. This is not an option for me. The files I work with can be very large (largest I encountered was >50GB) and by using BufferedReader I wish to put less of a strain on my system.
I suggest accumulating the information you get as your read it in a single pass parser. This will be simpler and faster in this case I suspect.
BTW, you want to cache your Patterns as creating them is quite expensive. You may find that you want ovoid using them entirely in some cases.
The code starts by finding a continuation line and calling reset() if it does not find it, but the code that reads additional lines does not seem to do that. Could it be reading the start of another section in the Genbank file and not putting it back? I don't see all the loop control code here, but what I do see appears to be correct.
If all else fails and you need something easy, there's always BioJava (see How to Read a Genbank File with Biojava3 and see if it helps). I have tried to use BioJava for my own projects, but it always falls a little short.
When I have written FASTA and FASTQ parsers, I read into a byte or char buffer and process it that way, but there is more buffer management code to write. That way, I don't have to worry about putting bytes back in a buffer. This can also avoid regex, which can be expensive in a time-critical application. Of course, this take more time to implement.
Tip: For fastest implementation if you are managing the buffer yourself, check out NIO (Java NIO Tutorial). I have seen give up a 10x speedup in some cases (writing data). The only drawback is that I have not found an easy way to read gzipped sequence data with NIO yet.
I am trying to classify a CSV file using Mahout, my understanding is that, first I need to convert the data in the CSV into vectors that can then be used by one of the mahout classification algorithms. My CSV file consists of text and word-like values and multiple classes.
I have searched here and found some vague explanations on how to do this but couldn't find any examples. Can anyone please provide a simple example in how to accomplish this? or is there any utility available that does this for you?.
I was asuming this would be a very common task but couldn't really find any clear examples.
Any help will be greatly appreciated.
You have some text and word-like value so you should probably use the 20 news-group example to get inspired. It is a nice example and you can easily reproduce a code with your csv file from it.
Here is a working link of the last version of mahout for the 20 news-group:
https://github.com/jpatanooga/MahoutExamples/blob/master/src/main/java/com/cloudera/mahout/classification/sgd/TwentyNewsgroups.java
There is just an adaptation to make with the countWords method with the changes of TokenSream object, here is a working code with last version of Mahout:
private static void countWords(Analyzer analyzer, Collection<String> words, Reader in) throws IOException {
// use the provided analyzer to tokenize the input stream
TokenStream ts = analyzer.tokenStream("text", in);
ts.addAttribute(CharTermAttribute.class);
ts.reset();
// for each word in the stream, minus non-word stuff, add word to collection
while (ts.incrementToken()) {
String s = ts.getAttribute(CharTermAttribute.class).toString();
words.add(s);
}
ts.end();
ts.close();
/*overallCounts.addAll(words);*/
}
I hope it will help you. I used this example to adapt with a CSV file and it worked.
this is my first post here. I'm excited to finally take part.
I'm working on a project where I'm parsing obscure files types. I need to be able to parse word (which I've already done), .sbs, .day, .cmp, and more. All of these types can be opened simply with notepad and displayed.
Since I'm so new to this stuff, is there a way I can use some generic library (or two) to open all of these up? And if so what library would it be?
What's a best practice in this sort of circumstance?
Thanks!
You could use the Apache Commons IO library. FileUtils class has several methods that receives the file path and optionlly the file encoding.
If you just want to only read text files and save them to a text variable
java.io.File file = new java.io.File("C:\\dir\\file.cmp");
String allWordAndLines = org.apache.commons.io.FileUtils.readFileToString(file);
If you want each line separately and store them in a collection:
java.util.List<String> lines = org.apache.commons.io.FileUtils.readLines(file);
for(String line : lines) {
// do something with line
}
To specify the encoding, you need to add another parameter:
org.apache.commons.io.FileUtils.readFileToString(file, "UTF-8");
org.apache.commons.io.FileUtils.readLines(file, "Cp1252");
Java include several classes for read files, see more in http://docs.oracle.com/javase/tutorial/essential/io/index.html
I hope this can help you if you are looking for only to have your text file is available in memory.
I want to add words an opensource Java word splitting program for Khmer (a language that does not have spaces between words). The developers have not worked on it in a long time, and I haven't been able to contact them for details (http://sourceforge.net/projects/khmer/files/Khmer%20Word%20Breaking/Khmer%20Word%20Breaking%20program%20V1.0/). Supposedly the list was created from a Khmer dictionary, and I would like to re-create the file to include more words.
Can anyone identify what format the word dictionary is in (I believe it is some type of Trie)? Here are the first few lines:
0ឳមអគណជយឍឫហកដពទឱលថឦឡញឩខនឧផប។ឋវឭឈឃឥឌឰឪសងចភធឯតឆរ
1ទ
0ក
1
1ីែមគួណជយ៍ៀហកទុលេញ៉ឺនំឹៃូឈឃោាឿសងចិ្ធើតៅរ
1គនសងរ
0ទ
0ា
0យ
0ព
0ន
1
1រ
0ា
0ស
0ី
1
And does anyone know how I would go about making a new one (I have a large wordlist, but I am not sure how to get it into this format).
Thanks!
After a quick look through the code, I have a theory.
Create a SearchTree which extends TreeItem. For each word in your dictionary, call addWord from TreeItem. When the iteration is done, call export on SearchTree. Use new file as the word input file.
Additionally, there may be an undocumented parameter for khwrdbrk.jar, --create, that will read the words for the new tree from standard input.
Again, just a theory, but let me know what happens if you test it out.