I am using openPdf library (fork of iTextPdf) to replace placeholders like #{Address_name_1} with real values. My PDF file is not simple, so I use regular expression to find this placeholder:[{].*?[A].*?[d].*?[d].*?[r].*?[e].*?[s].*?[s].*?[L].*?[i].*?[n].*?[e].*?[1].*?[}]
and do something like
content = MY_REGEXP.replace(content, "Saint-P, Nevskiy pr.");
obj.setData(content.toByteArray(CHARSET)).
The problem happens when my replacement line is too long and it is unfortunately cut from right end. Can I somehow make it carry over to the next line? Naive \n does not work.
PDF store strings in a different way. There are no next lines, there are lines.
So you will need to add several placeholders on fields on your template for replacements that can get long enough, like:
#{Address_name_1_line1}
#{Address_name_1_line2}
#{Address_name_1_line3}
And place it in different lines on your template. The non-used empty placeholders (because replacement is not long enough) should be replaced by empty strings.
For longer replacements you will need to use several placeholders. The number of placeholders to use and the replacement splitting should be determined by code.
If your PDF is too complex to place different placeholders then you will need to placeholder everything, all your text contents should be inyected into placeholders, at least if you want to use this approach.
PDF files are NOT text files. Each line is an object with an x/y offset. To place something on the next line would require a new object placed at new x/y coords. You would need an advanced PDF editing toolkit.
Related
I have a huge xml file from wiktionary that I need to parse for a class project. I only need to extract data from a set of 200 lines, which start at line 395,000. How would I go about only scanning that small number of lines? Is there some sort of built in property for line number?
If line boundaries are significant in your data then it's not true XML. Accept it for what it is, a line-oriented file, and start by processing it using line-oriented text tools. Use these to extract the XML (if you can), and then pass this XML to an XML parser.
There is no built in property for line numbers.
If you want to look at all of the data from line 395,000 to 395,200 programatically, you can do so by counting new line characters.
Each line in the file ends with a new line ("\n"), so you could count 349,999 of them, and then look at the data until you see 200 more.
I have a program which will be used for building questions database. I'm making it for a site that want user to know that contet was donwloaded from that site. That's why I want the output be PDF - almost everyone can view it, almost nobody can edit it (and remove e.g. footer or watermark, unlike in some simpler file types). That explains why it HAS to be PDF.
This program will be used by numerous users which will create new databases or expand existing ones. That's why having output formed as multple files is extremly sloppy and inefficient way of achieving what I want to achieve (it would complicate things for the user).
And what I want to do is to create PDF files which are still editable with my program once created.
I want to achieve this by implementing my custom file type readable with my program into the output PDF.
I came up with three ways of doing that:
Attach the file to PDF and then corrupting the part of PDF which contains it in a way it just makes the PDF unaware that it contains the file, thus making imposible for user to notice it (easely). Upon reading the document I'd revert the corruption and extract file using one of may PDF libraries.
Hide the file inside an image which would be added to the PDF somwhere on the first or last page, somehow (that is still need to work out) hidden from the public eye. Knowing it's location, it should be relativley easy to retrieve it using PDF library.
I have learned that if you add "%" sign as a first character in line inside a PDF, the whole line will be ignored (similar to "//" in Java) by the PDF reader (atleast Adobe reader), making possible for me to add as many lines as I want to the PDF (if I know where, and I do) whitout the end user being aware of that. I could implement my whole custom file into PDF that way. The problem here is that I actually have to read the PDF using one of the Java's input readers, but I'm not sure which one. I understand that PDF can't be read like a text file since it's a binary file (Right?).
In the end, I decided to go with the method number 3.
Unless someone has any better ideas, and the conditions are:
1. One file only. And that file is PDF.
2. User must not be aware of the addition.
The problem is that I don't know how to read the PDF as a file (I'm not trying to read it as a PDF, which I would do using a PDF library).
So, does anyone have a better idea?
If not, how do I read PDF as a FILE, so the output is array of characters (with newline detection), and then rewrite the whole file with my content addition?
In Java, there is no real difference between text and binary files, you can read them both as an inputstream. The difference is that for binary files, you can't really create a Reader for it, because that assumes there's a way to convert the byte stream to unicode characters, and that won't work for PDF files.
So in your case, you'd need to read the files in byte buffers and possibly loop over them to scan for bytes representing the '%' and end-of-line character in PDF.
A better way is to use another existing way of encoding data in a PDF: XMP tags. This is allows any sort of complex Key-Value pairs to be encoded in XML and embedded in PDF's, JPEGs etc. See http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf.
There's an open source library in Java that allows you to manipulate that: http://pdfbox.apache.org/userguide/metadata.html. See also a related question from another guy who succeeded in it: custom schema to XMP metadata or http://plindenbaum.blogspot.co.uk/2010/07/pdfbox-insertextract-metadata-frominto.html
It's all just 1's and 0's - just use RandomAccessFile and start reading. The PDF specification defines what a valid newline character(s) is/are (there are several). Grab a hex editor and open a PDF and you can at least start getting a feel for things. Be careful of where you insert your lines though - you'll need to add them towards the end of the file where they won't screw up the xref table offsets to the obj entries.
Here's a related question that may be of interest: PDF parsing file trailer
I would suggest putting your comment immediately before the startxref line. If you put it anywhere else, you could wind up shifting things around and breaking the xref table pointers.
So a simple algorithm for inserting your special comment will be:
Go to the end of the file
Search backwards for startxref
Insert your special comment immediately before startxref - be sure to insert a newline character at the end of your special comment
Save the PDF
You can (and should) do this manually in a hex editor.
Really important: are your users going to be saving changes to these files? i.e. if they fill in the form field, are they going to hit save? If they are, your comment lines may be removed during the save (and different versions of different PDF viewers could behave differently in this regard).
XMP tags are the correct way to do what you are trying to do - you can embed entire XML segments, and I think you'd be hard pressed to come up with a data structure that couldn't be expressed as XML.
I personally recommend using iText for this, but I'm biased (I'm one of the devs). The iText In Action book has an excellent chapter on embedding XMP data into PDFs. Here's some sample code from the book (which I definitely recommend): http://itextpdf.com/examples/iia.php?id=217
Thanks for reading :)
I'm trying to search for words like "UTTD_Equip_City_TE" across RTF documents using Lucene. This word appears in two different forms:
«UTTD_Equip_City_TE»,
«UTTD_Equip_City_TE»
I first tried with StandardAnalyzer, but it seems to break down the word into "UTTD", "Equip", "City", and "TE".
Then I tried again using WhiteSpaceAnalyzer, but it doesn't seem to be working... (I don't know why).
Could you help me I should approach this problem? By the way, editing the Lucene source and recompiling it with Ant is not an option :(
Thanks.
EDIT: there are other texts in this document, too. For example:
SHIP TO LESSEE (EQUIPMENT location address): «UTTD_Equip_StreetAddress_TE», «UTTD_Equip_City_TE», «UTTD_Equip_State_MC»
Basically, I'm trying to index RTF files, and inside each RTF file is tables with variables. Variables are wrapped with « and » . I'm trying to search those variables in the documents. I've tried searching "«" + string + "»", but it hasn't worked...
This example could give a better picture: http://i.imgur.com/SwlO1.png
Please help.
KeywordAnalyzer tokenizes the entire field as a single string. It sounds like this might be what you're looking for, if the substrings are in different fields within your document.
See: KeywordAnalyzer
Instead, if you are adding the entire content of the document within a single field, and you want to search for a substring with embedded '_' characters within it, then I would think that WhitespaceAnalyzer would work. You stated that it didn't work, though. Can you tell us what the results were when you tried using WhitespaceAnalyzer? And did you use it for both Indexing and Querying?
I see two options here. In both cases you have to build a custom analyzer.
Option 1
Start with StandardTokenizer's grammar file and customize it so that it emits text separated by '_' as a single token. (refer to Generating a custom Tokenizer for new TokenStream API using JFlex/ Java CC). Build your Analyzer using this new Tokenizer along with LowerCaseFilter.
Oprion 2
Write a Custom Analyzer that is made of WhiteSpaceTokenizer and custom TokenFilters. In these TokenFilters you decide on how to act on the tokens returned by WhiteSpaceTokenizer.
Refer to http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/analysis/package-summary.html for more details on analysis
I have a java string like the one below which has multiple lines and blank spaces. Need to remove all of them such that these are one line.
These are xml tags and the editor is not allowing to include less than symbol
<paymentAction>
Authorization
</paymentAction>
Should become
<paymentAction>AUTHORIZATION</paymentAction>
Thanks in advance
Calling theString.replaceAll("\\s+","") will replace all whitespace sequences with the empty string. Just be sure that the text between the tags doesn't contain spaces too, othewerise they'll get removed too.
You essentially want to convert the XML you have to Canonical Form. Below is one way of doing it but it requires you to use that library. If you doesn't want to depend upon external libraries then another option for you is to use XSLT.
The Canonicalizer class at Apache XML Security project:
NOTE: Dealing with non-xml aware API's (String.replaceAll()) is not generally recommended as you end up dealing with special/exception cases.
This is a start. Probably not enough, but should be in the right direction.
xml.replaceAll(">\\s*", ">").replaceAll("\\s*<, "<");
However, I'm tempted to say there has to be a way to create a document from the XML and then serialize it in canonical form as Pangea suggested.
So say you have a file that is written in XML or soe other coding language of that sort. Is it possible to just rewrite one line rather than getting the entire file into a string, then changing then line, then having to rewrite the whole string back to the file?
In general, no. File systems don't usually support the idea of inserting or modifying data in the middle of a file.
If your data file is in a fixed-size record format then you can edit a record without overwriting the rest of the file. For something like XML, you could in theory overwrite one value with a shorter one by inserting semantically-irrelevant whitespace, but you wouldn't be able to write a larger value.
In most cases it's simpler to just rewrite the whole file - either by reading and writing in a streaming fashion if you can (and if the file is too large to read into memory in one go) or just by loading the whole file into some in-memory data structure (e.g. XDocument), making the changes, and then saving the file again. (You may want to consider saving to a different file then moving the files around to avoid losing data if the save operation fails for some reason.)
If all of this ends up being too expensive, you should consider using multiple files (so each one is smaller) or a database.
If the line you want to replace is larger than the new line that you want to replace it with, then it is possible as long as it is acceptable to have some kind of padding (for example white-space characters ' ') which will not effect your application.
If on the other hand the new content are larger than the content to be replaced you will need to shift all the data downwards, so you need to rewrite the file, or at least from the replaced line onwards.
Since you mention XML, it might be you are approaching your problem in the wrong way. Could it be that what you need is to replace a specific XML node? In which case you might consider using DOM to read the XML into a hierarchy of nodes and adding/updating/removing in there before writing the XML tree back to the file.