getting /u0000 while replacing a string in pdf using pdfbox [duplicate] - java

This question already has an answer here:
Apache PDFBox: problems with encoding
(1 answer)
Closed 2 years ago.
I am getting a really rare issue. I am creating a PDF from HTML using wkhtmlTopdf and getting a nicely-created pdf.
But when I want to replace a word using pdfbox in the same string I am not able to do that.
why: because I am getting null character while reading the content from Operators.
My Code:
protected static void replaceText(String word) throws IOException, COSVisitorException {
PDPage page = page1; // page1 is a variable which I assigns at class level
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
for(int i = 0; i < tokens.size(); i++){
Object next = tokens.get(i);
if(next instanceof PDFOperator){
PDFOperator operator = (PDFOperator) next;
if (operator.getOperation().equals("Tj")) {
COSString previous = (COSString) tokens.get(i - 1);
String string = previous.getString();//here i am getting /u0000 which is null
List<String> listOfStrings = Arrays.asList(string.split(" "));
if(listOfStrings.contains(word)) {
string = string.replaceFirst(word, "");
previous.reset();
previous.append(string.getBytes(StandardCharsets.ISO_8859_1));
}
}else if (operator.getOperation().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(i - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();// same here
List<String> listOfStrings = Arrays.asList(string.split(" "));
if(listOfStrings.contains(word)) {
System.out.println(string);
string = string.replaceFirst(word, "");
cosString.reset();
cosString.append(string.getBytes(StandardCharsets.ISO_8859_1));
}
}
}
}
}
}
PDStream updatedStream = new PDStream(document);
OutputStream outputStream = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(outputStream);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
document.save(staticFileName);
}
I am using pdfbox 1.8.6 which is the limitation for me.
I have tested this code on other pdfs(which are not created by wkhtmltopdf) and it works fine.

It is completely normal that string operands of text drawing operators like Tj contain bytes with value 0.
Your code only works for special pdfs which use fonts with an ASCII'ish encoding (like WinAnsiEncoding) for the text to replace and also meet some other preconditions.
A generic solution to remove specific words from a pdf is somewhere between very complicated and not automatically possible.
The string operands of text drawing operators consist of bytes encoded according to the Encoding entry of the current font.
This encoding may resemble something common, something ASCII'ish like WinAnsiEncoding; but it may also be something completely different. Often ad-hoc encodings are used, e.g. if the text on the page shows "Test text", an encoding mapping 0 to 'T', 1 to 'e', 2 to 's', 3 to 't', 4 to ' ', and 5 to 'x' may be used and the string for drawing that text would consist of the bytes 0, 1, 2, 3, 4, 3, 1, 5, and 3.
Thus, in general you need to keep track of the current font and use information from it to decode the string arguments

Related

Replace or remove text from PDF with PDFbox in Java

I'm trying to use PDFBOX 2.0 to replace empty or delete a text pattern, (in my case i want to remove all "[QR]" words from all PDF), but I can't find anything that works for me.
I tried itext, but the same, nothing works.
The "[QR]" string from my pdf were edited after the PDF was created, maybe that's why they don't appear as tj operators?
My main:
replaceText(documentoPDF, "[QR]", "");
My method (i printed Tj values and my pattern dont appear there):
public void replaceText(PDDocument documentoPDF, String searchString, String replacement) throws IOException{
for ( PDPage page : documentoPDF.getPages()){
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<?> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++){
Object next = tokens.get(j);
if (next instanceof Operator){
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj"))
{
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else
if (op.getName().equals("TJ"))
{
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++)
{
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString)
{
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
System.out.println(pstring.trim());
if (searchString.equals(pstring.trim()))
{
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size()-1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(documentoPDF);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
out.close();
page.setContents(updatedStream);
}
documentoPDF.save("resources\\resultado\\nuevo.pdf");
}
This is an example of pdf with some [QR] patterns: http://www.mediafire.com/file/9w3kkc4yozwsfms/file
If someone can help, i will appreciate it.
I can upload my entire project if you need
Thanks in advance.
As already mentioned in comments, the reason why your code doesn't work is simple - you completely ignore the encoding of the font of that text. In the content stream there actually are [( >) ( 4) ( 5) ( #) ] TJ instructions (The "spaces" before '>', '4', '5', and '#' actually are zero bytes, 0x00). Thus, apparently the encoding is some 16bit encoding which additionally does not have ASCII naturally embedded.
To properly take the font into account one has to keep track of the current font. This means parsing the whole content stream and analyzing text font setting calls, save graphics state calls, and restore graphics state calls. Then you have to retrieve the proper font object from the correct resources.
All this actually is already done by the PDFBox content parsing framework used for e.g. text extraction. Thus, we can create a content stream editor around this framework.
Actually, this also has already been done, see the PdfContentStreamEditor from this answer.
As in case of your document the text pieces to delete are drawn by a single text drawing instruction each and each of these instructions draws only a text piece to remove, we can simply look at the text the current instruction draws and then decide whether to keep the instruction or not:
PDDocument document = ...;
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
final StringBuilder recentChars = new StringBuilder();
#Override
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector displacement)
throws IOException {
String string = font.toUnicode(code);
if (string != null)
recentChars.append(string);
super.showGlyph(textRenderingMatrix, font, code, displacement);
}
#Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
String recentText = recentChars.toString();
recentChars.setLength(0);
String operatorString = operator.getName();
if (TEXT_SHOWING_OPERATORS.contains(operatorString) && "[QR]".equals(recentText))
{
return;
}
super.write(contentStreamWriter, operator, operands);
}
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
editor.processPage(page);
}
document.save("nuevo-noQrText.pdf");
(EditPageContent test testRemoveQrTextNuevo)
Depending on your PDFBox version the showGlyph method to override may have a fifth parameter; thus, please check the showGlyph signature of your PDFBox copy and adapt if this code does not work. Thanks to #DanielNorberg for the hint!
In the result the "[QR]" texts underneath the QR codes have vanished, e.g.
became

How to replace text in a pdf with correct encoding using Itext

I create a java program for translating PDFs. I am using google API for translation. I am getting the translation correct on my Eclipse IDE Console but when I check the newly created pdf, either it's not translated and copied as it is or few words are translated or the new pdf comes as empty and sometimes corrupted.
I suppose it has something to do with encoding & font types.
I have already gone through the Itext page & all the related questions but none worked for my case. I am trying to translate Portuguese Spanish Finnish French Hungarian, etc into English.
Here is my code:
public static final String SRC = "5587309Finnish.pdf";
public static final String DEST = "changed.pdf";
public static void main(String[] args) throws java.io.IOException, DocumentException {
Translate translate = TranslateOptions.getDefaultInstance().getService();
PdfReader reader = new PdfReader(SRC);
int pages = reader.getNumberOfPages();
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
for(int i=1;i<=pages;i++) {
PdfDictionary dict = reader.getPageN(i);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
String pageContent =
PdfTextExtractor.getTextFromPage(reader, i);
String[] word = pageContent.split(" ");
PRStream stream = (PRStream) object;
byte[] data = PdfReader.getStreamBytes(stream);
String dd = new String(data, BaseFont.CP1252);
for (int j=0; j < word.length; j++)
{
Translation translation = translate.translate(word[j],Translate.TranslateOption.sourceLanguage("fi"),
Translate.TranslateOption.targetLanguage("en"));
System.out.println(word[j]+"-->>"+translation.getTranslatedText());//here i can check the translation is correct.
dd = dd.replace(word[j],translation.getTranslatedText());
}
stream.setData(dd.getBytes());
}
}
stamper.close();
reader.close();
}
Please help.
According to a comment you have improved your code and are
getting the update dd(i.e. content stream which I am printing) correctly with the replaced text. I don't know why I am getting a blank pdf
Thus, I assume that your (hopefully representative) test PDFs have all their fonts of interest encoded in ANSI'ish encodings and the text arguments of the text drawing instructions contain whole words or even phrases which can properly be processed because otherwise text replacement would not have been possible.
Thus, here an example how one can replace text pieces with similarly long ones under such benign circumstances without breaking the content stream syntax. In this example I simply use a Map containing replacement strings. You can do your translation there.
First a frame loading the source, creating a stamper, iterating over the pages, and calling a helper to create a content stream replacement:
Map<String, String> replacements = new HashMap<>();
replacements.put("Förfallodatum", "Ablaufdatum");
try ( InputStream resource = SOURCE_INPUTSTREAM;
OutputStream result = new FileOutputStream(RESULT_FILE) ) {
PdfReader pdfReader = new PdfReader(resource);
PdfStamper pdfStamper = new PdfStamper(pdfReader, result);
for (int pageNum = 1; pageNum <= pdfReader.getNumberOfPages(); pageNum++) {
PdfDictionary page = pdfReader.getPageN(pageNum);
byte[] pageContentInput = ContentByteUtils.getContentBytesForPage(pdfReader, pageNum);
page.remove(PdfName.CONTENTS);
replaceInStringArguments(pageContentInput, pdfStamper.getUnderContent(pageNum), replacements);
}
pdfStamper.close();
}
(EditPageContentSimple test testReplaceInStringArgumentsForklaringAvFakturan)
The method replaceInStringArguments now parses the instructions in the given content stream, isolates string arguments, and calls another helper for each string argument doing the replacement.
void replaceInStringArguments(byte[] contentBytesBefore, PdfContentByte canvas, Map<String, String> replacements) throws IOException {
PRTokeniser tokeniser = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytesBefore)));
PdfContentParser ps = new PdfContentParser(tokeniser);
ArrayList<PdfObject> operands = new ArrayList<PdfObject>();
while (ps.parse(operands).size() > 0){
for (int i = 0; i < operands.size(); i++) {
PdfObject pdfObject = operands.get(i);
if (pdfObject instanceof PdfString) {
operands.set(i, replaceInString((PdfString)pdfObject, replacements));
} else if (pdfObject instanceof PdfArray) {
PdfArray pdfArray = (PdfArray) pdfObject;
for (int j = 0; j < pdfArray.size(); j++) {
PdfObject arrayObject = pdfArray.getPdfObject(j);
if (arrayObject instanceof PdfString) {
pdfArray.set(j, replaceInString((PdfString)arrayObject, replacements));
}
}
}
}
for (PdfObject object : operands)
{
object.toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append((byte) ' ');
}
canvas.getInternalBuffer().append((byte) '\n');
}
}
(EditPageContentSimple helper method)
The method replaceInString in turn retrieves a single string operand (a PdfString instance), manipulates it, and returns the manipulated string version:
PdfString replaceInString(PdfString string, Map<String, String> replacements) {
String value = PdfEncodings.convertToString(string.getBytes(), PdfObject.TEXT_PDFDOCENCODING);
for (Map.Entry<String, String> entry : replacements.entrySet()) {
value = value.replace(entry.getKey(), entry.getValue());
}
return new PdfString(PdfEncodings.convertToBytes(value, PdfObject.TEXT_PDFDOCENCODING));
}
(EditPageContentSimple helper method)
Instead of that for loop here you would call your translation routine and translate value.
As has been mentioned before, this code only works under certain benign circumstances. Don't expect it to work for arbitrary documents from the wild, in particular not for documents with other than Western European glyphs.

Java convert unicode code point to string

How can UTF-8 value like =D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0 be converted in Java?
I have tried something like:
Character.toCodePoint((char)(Integer.parseInt("D0", 16)),(char)(Integer.parseInt("93", 16));
but it does not convert to a valid code point.
That string is an encoding of bytes in hex, so the best way is to decode the string into a byte[], then call new String(bytes, StandardCharsets.UTF_8).
Update
Here is a slightly more direct version of decoding the string, than provided by "sstan" in another answer. Of course both versions are good, so use whichever makes you more comfortable, or write your own version.
String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";
assert src.length() % 3 == 0;
byte[] bytes = new byte[src.length() / 3];
for (int i = 0, j = 0; i < bytes.length; i++, j+=3) {
assert src.charAt(j) == '=';
bytes[i] = (byte)(Character.digit(src.charAt(j + 1), 16) << 4 |
Character.digit(src.charAt(j + 2), 16));
}
String str = new String(bytes, StandardCharsets.UTF_8);
System.out.println(str);
Output
Газета
In UTF-8, a single character is not always encoded with the same amount of bytes. Depending on the character, it may require 1, 2, 3, or even 4 bytes to be encoded. Therefore, it's definitely not a trivial matter to try to map UTF-8 bytes yourself to a Java char which uses UTF-16 encoding, where each char is encoded using 2 bytes. Not to mention that, depending on the character (code point > 0xffff), you may also have to worry about dealing with surrogate characters, which is just one more complication that you can easily get wrong.
All this to say that Andreas is absolutely right. You should focus on parsing your string to a byte array, and then let the built-in libraries convert the UTF-8 bytes to a Java string for you. From a Java String, it's trivial to extract the Unicode code points if that's what you want.
Here is some sample code that shows one way this can be achieved:
public static void main(String[] args) throws Exception {
String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";
// Parse string into hex string tokens.
String[] tokens = Arrays.stream(src.split("="))
.filter(s -> s.length() != 0)
.toArray(String[]::new);
// Convert the hex string representations to a byte array.
byte[] utf8bytes = new byte[tokens.length];
for (int i = 0; i < utf8bytes.length; i++) {
utf8bytes[i] = (byte) Integer.parseInt(tokens[i], 16);
}
// Convert UTF-8 bytes to Java String.
String str = new String(utf8bytes, StandardCharsets.UTF_8);
// Display string + individual unicode code points.
System.out.println(str);
str.codePoints().forEach(System.out::println);
}
Output:
Газета
1043
1072
1079
1077
1090
1072

Reading text from swf with StuartMacKay's transform-swf library

I need to extract all the texts from some swf files. I'm using Java since I have a lot of modules developed with this language.
Thus, I did a search through the Web for all the free Java library devoted to handle SWF files.
Finally, I found the library developed by StuartMacKay. The library, named transform-swf, may be found on GitHub by clicking here.
The question is: Once I extract the GlyphIndexes from a TextSpan, how can I convert the glyps in characters?
Please, provide a complete working and tested example. No theoretical answer will be accepted nor answers like "it cannot be done", "it ain't possible", etc.
What I know and what I did
I know that the GlyphIndexes are built by using a TextTable, which is constructed by recurring to an integer that represente the font size and a font description provided by a DefineFont2 object, but when I decode all the DefineFont2, all have a zero length advance.
Here follows what I did.
//Creating a Movie object from an swf file.
Movie movie = new Movie();
movie.decodeFromFile(new File(out));
//Saving all the decoded DefineFont2 objects.
Map<Integer,DefineFont2> fonts = new HashMap<>();
for (MovieTag object : list) {
if (object instanceof DefineFont2) {
DefineFont2 df2 = (DefineFont2) object;
fonts.put(df2.getIdentifier(), df2);
}
}
//Now I retrieve all the texts
for (MovieTag object : list) {
if (object instanceof DefineText2) {
DefineText2 dt2 = (DefineText2) object;
for (TextSpan ts : dt2.getSpans()) {
Integer fontIdentifier = ts.getIdentifier();
if (fontIdentifier != null) {
int fontSize = ts.getHeight();
// Here I try to create an object that should
// reverse the process done by a TextTable
ReverseTextTable rtt =
new ReverseTextTable(fonts.get(fontIdentifier), fontSize);
System.out.println(rtt.charactersForText(ts.getCharacters()));
}
}
}
}
The class ReverseTextTable follows here:
public final class ReverseTextTable {
private final transient Map<Character, GlyphIndex> characters;
private final transient Map<GlyphIndex, Character> glyphs;
public ReverseTextTable(final DefineFont2 font, final int fontSize) {
characters = new LinkedHashMap<>();
glyphs = new LinkedHashMap<>();
final List<Integer> codes = font.getCodes();
final List<Integer> advances = font.getAdvances();
final float scale = fontSize / EMSQUARE;
final int count = codes.size();
for (int i = 0; i < count; i++) {
characters.put((char) codes.get(i).intValue(), new GlyphIndex(i,
(int) (advances.get(i) * scale)));
glyphs.put(new GlyphIndex(i,
(int) (advances.get(i) * scale)), (char) codes.get(i).intValue());
}
}
//This method should reverse from a list of GlyphIndexes to a String
public String charactersForText(final List<GlyphIndex> list) {
String text="";
for(GlyphIndex gi: list){
text+=glyphs.get(gi);
}
return text;
}
}
Unfortunately, the list of advances from DefineFont2 is empty, then the constructor of ReverseTableText get an ArrayIndexOutOfBoundException.
Honestly, I don't know how to do that in Java. I'm not claiming that it is not possible, I also believe that there is a way to do that. However, you said that there are a lot of libraries that do that. You also suggested a library, i.e. swftools. So, I suggest to recurr to that library to extract the text from a flash file. To do that you can use Runtime.exec() just to execute a command line to run that library.
Personally, I prefer Apache Commons exec rather than the standard library released with JDK. Well, just let me show you how you should do. The executable file that you should use is "swfstrings.exe". Suppose that it is put in "C:\". Suppose that in the same folder you can find a flash file, e.g. page.swf. Then, I tried the following code (it works fine):
Path pathToSwfFile = Paths.get("C:\" + File.separator + "page.swf");
CommandLine commandLine = CommandLine.parse("C:\" + File.separator + "swfstrings.exe");
commandLine.addArgument("\"" + swfFile.toString() + "\"");
DefaultExecutor executor = new DefaultExecutor();
executor.setExitValues(new int[]{0, 1}); //Notice that swfstrings.exe returns 1 for success,
//0 for file not found, -1 for error
ByteArrayOutputStream stdout = new ByteArrayOutputStream();
PumpStreamHandler psh = new PumpStreamHandler(stdout);
executor.setStreamHandler(psh);
int exitValue;
try{
exitValue = executor.execute(commandLine);
}catch(org.apache.commons.exec.ExecuteException ex){
psh.stop();
}
if(!executor.isFailure(exitValue)){
String out = stdout.toString("UTF-8"); // here you have the extracted text
}
I know, this is not exactly the answer that you requested, but works fine.
I happened to be working on decompiling an SWF in Java now and I came across this question while figuring out how to reverse engineer the original text back.
After looking at the source code, I realise its really straightforward. Each font has an assigned sequence of characters that can be retrieved by calling DefineFont2.getCodes(), and the glyphIndex is the index to the matching character in DefineFont2.getCodes().
However, in cases where there are multiple fonts in use in a single SWF file, it is difficult to match each DefineText to the corresponding DefineFont2 because there's no attributes that identifies the DefineFont2 used for each DefineText.
To work around this issue, I came up with a self-learning algorithm which will attempt to guess the right DefineFont2 for each DefineText and hence derive the original text correctly.
To reverse engineer the original text back, I created a class called FontLearner:
public class FontLearner {
private final ArrayList<DefineFont2> fonts = new ArrayList<DefineFont2>();
private final HashMap<Integer, HashMap<Character, Integer>> advancesMap = new HashMap<Integer, HashMap<Character, Integer>>();
/**
* The same characters from the same font will have similar advance values.
* This constant defines the allowed difference between two advance values
* before they are treated as the same character
*/
private static final int ADVANCE_THRESHOLD = 10;
/**
* Some characters have outlier advance values despite being compared
* to the same character
* This constant defines the minimum accuracy level for each String
* before it is associated with the given font
*/
private static final double ACCURACY_THRESHOLD = 0.9;
/**
* This method adds a DefineFont2 to the learner, and a DefineText
* associated with the font to teach the learner about the given font.
*
* #param font The font to add to the learner
* #param text The text associated with the font
*/
private void addFont(DefineFont2 font, DefineText text) {
fonts.add(font);
HashMap<Character, Integer> advances = new HashMap<Character, Integer>();
advancesMap.put(font.getIdentifier(), advances);
List<Integer> codes = font.getCodes();
List<TextSpan> spans = text.getSpans();
for (TextSpan span : spans) {
List<GlyphIndex> characters = span.getCharacters();
for (GlyphIndex character : characters) {
int glyphIndex = character.getGlyphIndex();
char c = (char) (int) codes.get(glyphIndex);
int advance = character.getAdvance();
advances.put(c, advance);
}
}
}
/**
*
* #param text The DefineText to retrieve the original String from
* #return The String retrieved from the given DefineText
*/
public String getString(DefineText text) {
StringBuilder sb = new StringBuilder();
List<TextSpan> spans = text.getSpans();
DefineFont2 font = null;
for (DefineFont2 getFont : fonts) {
List<Integer> codes = getFont.getCodes();
HashMap<Character, Integer> advances = advancesMap.get(getFont.getIdentifier());
if (advances == null) {
advances = new HashMap<Character, Integer>();
advancesMap.put(getFont.getIdentifier(), advances);
}
boolean notFound = true;
int totalMisses = 0;
int totalCount = 0;
for (TextSpan span : spans) {
List<GlyphIndex> characters = span.getCharacters();
totalCount += characters.size();
int misses = 0;
for (GlyphIndex character : characters) {
int glyphIndex = character.getGlyphIndex();
if (codes.size() > glyphIndex) {
char c = (char) (int) codes.get(glyphIndex);
Integer getAdvance = advances.get(c);
if (getAdvance != null) {
notFound = false;
if (Math.abs(character.getAdvance() - getAdvance) > ADVANCE_THRESHOLD) {
misses += 1;
}
}
} else {
notFound = false;
misses = characters.size();
break;
}
}
totalMisses += misses;
}
double accuracy = (totalCount - totalMisses) * 1.0 / totalCount;
if (accuracy > ACCURACY_THRESHOLD && !notFound) {
font = getFont;
// teach this DefineText to the FontLearner if there are
// any new characters
for (TextSpan span : spans) {
List<GlyphIndex> characters = span.getCharacters();
for (GlyphIndex character : characters) {
int glyphIndex = character.getGlyphIndex();
char c = (char) (int) codes.get(glyphIndex);
int advance = character.getAdvance();
if (advances.get(c) == null) {
advances.put(c, advance);
}
}
}
break;
}
}
if (font != null) {
List<Integer> codes = font.getCodes();
for (TextSpan span : spans) {
List<GlyphIndex> characters = span.getCharacters();
for (GlyphIndex character : characters) {
int glyphIndex = character.getGlyphIndex();
char c = (char) (int) codes.get(glyphIndex);
sb.append(c);
}
sb = new StringBuilder(sb.toString().trim());
sb.append(" ");
}
}
return sb.toString().trim();
}
}
Usage:
Movie movie = new Movie();
movie.decodeFromStream(response.getEntity().getContent());
FontLearner learner = new FontLearner();
DefineFont2 font = null;
List<MovieTag> objects = movie.getObjects();
for (MovieTag object : objects) {
if (object instanceof DefineFont2) {
font = (DefineFont2) object;
} else if (object instanceof DefineText) {
DefineText text = (DefineText) object;
if (font != null) {
learner.addFont(font, text);
font = null;
}
String line = learner.getString(text); // reverse engineers the line
}
I am happy to say that this method has given me a 100% accuracy in reverse engineering the original String using StuartMacKay's transform-swf library.
Its seems to be difficult on what your trying to achieve, Your trying to secompile the file bur i am sorry to say that its not possible , What I would suggest you to do is to convert it into some bitmap (if possible) or by any other method try to read the characters using OCR
There are some software's which do that, you can also check some forums regarding that. Because once compiled version of swf is very difficult (and not possible as far as i know). You can check this decompiler if you want or try using some other languages like the project here
I had a similar problem with long strings using transform-swf library.
Got the source code and debugged it.
I believe there was a small bug in class com.flagstone.transform.coder.SWFDecoder.
Line 540 (applicable to version 3.0.2), change
dest += length;
with
dest += count;
That should do it for you (it's about extracting strings).
I notified Stuart as well. The problem appears only if your strings are very large.
I know this isn't what you asked but I needed to pull text from SWF recently using Java and found the ffdec library much better than transform-swf
Comment if anyone needs sample code

In Java, retrieve a JPEG from a URL and convert it to binary or hexadecimal form suitable for embedding in an RTF document

I'm trying to write a simple RTF document pretty much from scratch in Java, and I'm trying to embed JPEGs in the document. Here's an example of a JPEG (a 2x2-pixel JPEG consisting of three white pixels and a black pixel in the upper left, if you're curious) embedded in an RTF document (generated by WordPad, which converted the JPEG to WMF):
{\pict\wmetafile8\picw53\pich53\picwgoal30\pichgoal30
0100090000036e00000000004500000000000400000003010800050000000b0200000000050000
000c0202000200030000001e000400000007010400040000000701040045000000410b2000cc00
020002000000000002000200000000002800000002000000020000000100040000000000000000
000000000000000000000000000000000000000000ffffff00fefefe0000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000
0000001202af0801010000040000002701ffff030000000000
}
I've been reading the RTF specification, and it looks like you can specify that the image is a JPEG, but since WordPad always converts images to WMF, I can't see an example of an embedded JPEG. So I may also end up needing to transcode from JPEG to WMF or something....
But basically, I'm looking for how to generate the binary or hexadecimal (Spec, p.148: "These pictures can be in hexadecimal (the default) or binary format.") form of a JPEG given a file URL.
Thanks!
EDIT: I have the stream stuff working all right, I think, but still don't understand exactly how to encode it, because whatever I'm doing, it's not RTF-readable. E.g., the above picture instead comes out as:
ffd8ffe00104a464946011106006000ffdb0430211211222222223533333644357677767789b988a877adaabcccc79efdcebcccffdb04312223336336c878ccccccccccccccccccccccccccccccccccccccccccccccccccffc0011802023122021113111ffc401f001511111100000000123456789abffc40b5100213324355440017d123041151221314161351617227114328191a182342b1c11552d1f024336272829a161718191a25262728292a3435363738393a434445464748494a535455565758595a636465666768696a737475767778797a838485868788898a92939495969798999aa2a3a4a5a6a7a8a9aab2b3b4b5b6b7b8b9bac2c3c4c5c6c7c8c9cad2d3d4d5d6d7d8d9dae1e2e3e4e5e6e7e8e9eaf1f2f3f4f5f6f7f8f9faffc401f103111111111000000123456789abffc40b51102124434754401277012311452131612415176171132232818144291a1b1c19233352f0156272d1a162434e125f11718191a262728292a35363738393a434445464748494a535455565758595a636465666768696a737475767778797a82838485868788898a92939495969798999aa2a3a4a5a6a7a8a9aab2b3b4b5b6b7b8b9bac2c3c4c5c6c7c8c9cad2d3d4d5d6d7d8d9dae2e3e4e5e6e7e8e9eaf2f3f4f5f6f7f8f9faffda0c31021131103f0fdecf09f84f4af178574cd0b42d334fd1744d16d22bd3f4fb0b74b6b5bb78902450c512091c688aaaa8a0500014514507ffd9
This PHP library would do the trick, so I'm trying to port the relevant portion to Java. Here is is:
$imageData = file_get_contents($this->_file);
$size = filesize($this->_file);
$hexString = '';
for ($i = 0; $i < $size; $i++) {
$hex = dechex(ord($imageData{$i}));
if (strlen($hex) == 1) {
$hex = '0' . $hex;
}
$hexString .= $hex;
}
return $hexString;
But I don't know what the Java analogue to dechex(ord($imageData{$i})) is. :( I got only as far as the Integer.toHexString() function, which takes care of the dechex part....
Thanks all. :)
Given a file URL for any file you can get the corresponding bytes by doing (exception handling omitted for brevity)...
int BUF_SIZE = 512;
URL fileURL = new URL("http://www.somewhere.com/someurl.jpg");
InputStream inputStream = fileURL.openStream();
byte [] smallBuffer = new byte[BUF_SIZE];
ByteArrayOutputStream largeBuffer = new ByteArrayOutputStream();
int numRead = BUF_SIZE;
while(numRead == BUF_SIZE) {
numRead = inputStream.read(smallBuffer,0,BUF_SIZE);
if(numRead > 0) {
largeBuffer.write(smallBuffer,0,BUF_SIZE);
}
}
byte [] bytes = largeBuffer.toByteArray();
I'm looking at your PHP snippet now and realizing that RTF is a bizarre specification! It looks like each byte of the image is encoded as 2 hex digits (which doubles the size of the image for no apparent reason). The the entire thing is stored in raw ASCII encoding. So, you'll want to do...
StringBuilder hexStringBuilder = new StringBuilder(bytes.length * 2);
for(byte imageByte : bytes) {
String hexByteString = Integer.toHexString(0x000000FF & (int)imageByte);
if(hexByteString .size() == 1) {
hexByteString = "0" + hexByteString ;
}
hexStringBuilder.append(hexByteString);
}
String hexString = hexStringBuilder.toString();
byte [] hexBytes = hexString.getBytes("UTF-8"); //Could also use US-ASCII
EDIT: Updated code sample to pad 0's on the hex bytes
EDIT: negative bytes were getting logically right shifted when converted to ints >_<
https://joseluisbz.wordpress.com/2013/07/26/exploring-a-wmf-file-0x000900/
Maybe help you this:
String HexRTFBytes = "Representations text of bytes from Image RTF File";
String Destiny = "The path of the output File";
FileOutputStream wmf;
try {
wmf = new FileOutputStream(Destiny);
HexRTFBytes = HexRTFBytes.replaceAll("\n", ""); //Erase New Lines
HexRTFBytes = HexRTFBytes.replaceAll(" ", ""); //Erase Blank spaces
int NumBytesWrite = HexRTFBytes.length();
int WMFBytes = NumBytesWrite/2;//One byte is represented by 2 characters
byte[] ByteWrite = new byte[WMFBytes];
for (int i = 0; i < WMFBytes; i++){
se = HexRTFBytes.substring(i*2,i*2+2);
int Entero = Integer.parseInt(se,16);
ByteWrite[i] = (byte)Entero;
}
wmf.write(ByteWrite);
wmf.close();
}
catch (FileNotFoundException fnfe)
{System.out.println(fnfe.toString());}
catch (NumberFormatException fnfe)
{System.out.println(fnfe.toString());}
catch (EOFException eofe)
{System.out.println(eofe.toString());}
catch (IOException ioe)
{System.out.println(ioe.toString());}
This code take the representation in one string, and result is stored in a file.
https://joseluisbz.wordpress.com/2011/06/22/script-de-clases-rtf-para-jsp-y-php/
Now if you want to obtain the representation of the image file, you can use this:
private void ByteStreamImageString(byte[] ByteStream) {
this.Format = 0;
this.High = 0;
this.Wide = 0;
this.HexImageString = "Error";
if (ByteStream[0]== (byte)137 && ByteStream[1]== (byte)80 && ByteStream[2]== (byte)78){
this.Format = PNG; //PNG
this.High = this.Byte2PosInt(ByteStream[22],ByteStream[23]);
this.Wide = this.Byte2PosInt(ByteStream[18],ByteStream[19]);
}
if (ByteStream[0]== (byte)255 && ByteStream[1]== (byte)216
&& ByteStream[2]== (byte)255 && ByteStream[3]== (byte)224){
this.Format = JPG; //JPG
int PosJPG = 2;
while (PosJPG < ByteStream.length){
String M = String.format("%02X%02X", ByteStream[PosJPG+0],ByteStream[PosJPG+1]);
if (M.equals("FFC0") || M.equals("FFC1") || M.equals("FFC2") || M.equals("FFC3")){
this.High = this.Byte2PosInt(ByteStream[PosJPG+5],ByteStream[PosJPG+6]);
this.Wide = this.Byte2PosInt(ByteStream[PosJPG+7],ByteStream[PosJPG+8]);
}
if (M.equals("FFDA")) {
break;
}
PosJPG = PosJPG+2+this.Byte2PosInt(ByteStream[PosJPG+2],ByteStream[PosJPG+3]);
}
}
if (this.Format > 0) {
this.HexImageString = "";
int Salto = 0;
for (int i=0;i < ByteStream.length; i++){
Salto++;
this.HexImageString += String.format("%02x", ByteStream[i]);
if (Salto==64){
this.HexImageString += "\n"; //To make readable
Salto = 0;
}
}
}
}

Categories