As per this 3rd answer , I can write a file like this
Files.write(Paths.get("file6.txt"), lines, utf8,
StandardOpenOption.CREATE, StandardOpenOption.APPEND);
however when I try it on my code I got this error :
The method write(Path, Iterable, Charset,
OpenOption...) in the type Files is not applicable for the arguments
(Path, byte[], Charset, StandardOpenOption)
this is my code :
File dir = new File(myDirectoryPath);
File[] directoryListing = dir.listFiles();
if (directoryListing != null) {
File newScript = new File(newPath + "//newScript.pbd");
if (!newScript.exists()) {
newScript.createNewFile();
}
for (File child : directoryListing) {
if (!child.isDirectory()) {
byte[] content = null;
Charset utf8 = StandardCharsets.UTF_8;
content = readFileContent(child);
try {
Files.write(Paths.get(newPath + "\\newScript.pbd"), content,utf8,
StandardOpenOption.APPEND); <== error here in this line.
} catch (Exception e) {
System.out.println("COULD NOT LOG!! " + e);
}
}
}
}
Note if change my code to like it work and it writes into the file (remove utf8).
Files.write(Paths.get(newPath + "\\newScript.pbd"), content,
StandardOpenOption.APPEND);
Explanation
There are 3 overloads of the Files#write method (see the documentation):
Takes Path, byte[], OpenOption... (no charset)
Takes Path, Iterable<? extends CharSequence>, OpenOption... (like List<String>, no charset, uses UTF-8)
Takes Path, Iterable<? extends CharSequence>, Charset, OpenOption... (has charset)
For your call (Path, byte[], Charset, OpenOption...), no matching version exists. Thus, it does not compile.
It does not match the first and second overload, since they do not support the Charset and it does not match the third overload, since neither is an array Iterable (a class like ArrayList is), nor does byte extend CharSequence (String does).
In the error message you can see what Java computed as closest match to your call, unfortunately (as explained), it is not applicable for your arguments.
Solution
You most likely intended to go for the first overload:
Files.write(Paths.get("file6.txt"), lines,
StandardOpenOption.CREATE, StandardOpenOption.APPEND);
I.e. no charset.
Notes
A charset makes no sense in your context. The sole purpose of a charset is to convert correctly from String to binary byte[]. But your data is already binary, so the charset comes in place before that. In your case that would be the reading phase in readFileContent.
Also note that all methods in Files use UTF-8 by default already. So no need to additionally specify it anyways.
When specifying OpenOptions, you may also want to specify whether it is StandardOpenOption.READ or StandardOpenOption.WRITE mode. The Files#write method by default uses:
WRITE
CREATE
TRUNCATE_EXISTING
So you may want to call it with
WRITE
CREATE
APPEND
Examples
Here are some snippets of how to fully read and write in text and binary, in UTF-8:
// Text mode
List<String> lines = Files.readAllLines(inputPath);
// do something with lines
Files.write(outputPath, lines);
// Binary mode
byte[] content = Files.readAllBytes(inputPath);
// do something with content
Files.write(outputPath, content);
in the 3rd answer you have mentioned the content of the file is iterable, namely List. You can not use this method with a byte[]. Make your readFileContent() method return something iterable like List (each element is a line of your file).
Related
I do some acrofield manipulation for text fields which have parent fields. This works so far, but the form also contains some checkboxes, the will not be changed. But when I store the manipulated pdf to disk and inspect the value of the checkbox, i can see that the value of cb_a.0 has been changed from ÄÖÜ?ß to ?????
My further processing fails because of this unintended change, any idea how to prevent that?
My testcase
#Test
public void changeBoxedFieldsToOne() throws IOException {
File encodingPdfFile = new File(classLoader.getResource("./prefill/TestFormEncoding.pdf").getFile());
byte[] encodingPdfByte = Files.readAllBytes(encodingPdfFile.toPath());
PdfAcrofieldManipulator pdfMani = new PdfAcrofieldManipulator(encodingPdfByte);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() > 0);
pdfMani.changeBoxedFieldsToOne();
byte[] changedPdf = pdfMani.savePdf();
Files.write(Paths.get("./build/changeBoxedFieldsToOne.pdf"), changedPdf);
pdfMani = new PdfAcrofieldManipulator(changedPdf);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() == 0);
}
public void changeBoxedFieldsToOne() {
PDDocumentCatalog docCatalog = pdDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List<PDNonTerminalField> textFieldWithMoreThan2Childrens = getTextFieldsWithMoreThan2Children();
for (PDField field : textFieldWithMoreThan2Childrens) {
int amountOfChilds = ((PDNonTerminalField) field).getChildren().size();
String currentFieldName = field.getPartialName();
LOG.info("merging fields of fieldnam {0} to one field", currentFieldName);
PDField firstChild = getChildWithPartialName((PDNonTerminalField) field, "0");
if (firstChild == null ) {
LOG.debug("found field which has a dot but starts not with 0, skipping this field");
continue;
}
PDField lastChild = getChildWithPartialName((PDNonTerminalField) field, Integer.toString(amountOfChilds - 1));
PDPage pageWhichContainsField = firstChild.getWidgets().get(0).getPage();
try {
removeField(pdDocument, currentFieldName);
} catch (IOException e) {
LOG.error("Error while removing field {0}", currentFieldName, e);
}
PDField newField = creatNewField(acroForm, field, firstChild, lastChild, pageWhichContainsField);
acroForm.getFields().add(newField);
PDAnnotationWidget newFieldWidget = createWidgetForField(newField, pageWhichContainsField, firstChild, lastChild);
try {
pageWhichContainsField.getAnnotations().add(newFieldWidget);
} catch (IOException e) {
LOG.error("error while adding new field to page");
}
}
}
public byte[] savePdf() throws IOException {
try (final ByteArrayOutputStream out = new ByteArrayOutputStream()) {
//pdDocument.saveIncremental(out);
pdDocument.save(out);
pdDocument.close();
return out.toByteArray();
}
}
I am using PDFBox 2.0.8
Here is the source PDF:https://ufile.io/gr01f or here https://www.file-upload.net/download-12928052/TestFormEncoding.pdf.html
Here the output: https://ufile.io/k8cr3 or here https://www.file-upload.net/download-12928049/changeBoxedFieldsToOne.pdf.html
This indeed is a bug in PDFBox: PDFBox cannot properly handle PDF Name objects containing bytes with values outside the US_ASCII range (in particular outside the range 0..127, and your umlauts are outside).
The first error in PDF Name handling is that PDFBox internally represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy. This is wrong, according to the PDF specification a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). [...]
Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a PDF processor. However, occasionally the need arises to treat a name object as text, such as one that represents a font name [...], a colourant name in a Separation or DeviceN colour space, or a structure type [...]
In such situations, the sequence of bytes making up the name object should be interpreted according to UTF-8, a variable-length byte-encoded representation.
Thus, it generally does not make sense to treat a name as anything else than a byte sequence. Only names used in certain contexts should be meaningful as UTF-8 encoded strings.
Furthermore, a mixed UTF-8 / CP-1252 decoding strategy, i.e. one that first tries to decode using UTF-8 and in case of failure tries again with CP-1252, can create the same string representation for different name entities, so this can indeed falsify by making unequal names equal.
This is not the problem in your case, though, the names you used can be interpreted.
The second error is, though, that while serializing the PDF it only properly encodes the characters in the strings representing names which are from US_ASCII, all else are replaced by '?':
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
[...]
}
}
(from org.apache.pdfbox.cos.COSName.writePDF(OutputStream))
This is where your checkbox values (which internally are represented by PDF Name objects) get damaged beyond repair...
A more simple example to show the problem is this:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"), "äöüß");
document.save(new File(RESULT_FOLDER, "non-ascii-name.pdf"));
document.close();
In the result the catalog with the custom entry looks like this:
1 0 obj
<<
/Type /Catalog
/Version /1.4
/Pages 2 0 R
/#3F#3F#3F#3F <E4F6FCDF>
>>
In the name key all characters are replaced by '?' in hex encoded form (#3F) while in the string value the characters are appropriately encoded.
After a bit of searching I stumbled over an answer on this topic I gave almost two years ago. Back then the PDF Name object bytes were always interpreted as UTF-8 encoded which led to issues in that question.
As a consequence the issue PDFBOX-3347 was created. To resolve it the mixed UTF-8 / CP-1252 decoding strategy was introduced. As expressed above, though, I'm not a friend of that strategy.
In that stack overflow answer I also already discussed the problems related to the use of US_ASCII during PDF serialization but that aspect has not yet been addressed at all.
Another related issue is PDFBOX-3519 but its resolution also was reduced to trying to fix the parsing of PDF Names, ignoring the serialization of it.
Yet another related issue is PDFBOX-2836.
I'm looking to add a custom metadata tag to any type of file using functionality from java.nio.file.Files. I have been able to read metadata correctly, but am having issues whenever I try to set metadata.
I've tried to set a custom metadata element with a plain string using Files.setAttribute with the following
Path photo = Paths.get("C:\\Users\\some\\picture\\path\\2634.jpeg");
try{
BasicFileAttributes attrs = Files.readAttributes(photo, BasicFileAttributes.class);
Files.setAttribute(photo, "user:tags", "test");
String attribute = Files.getAttribute(photo, "user:tags").toString();
System.out.println(attribute);
}
catch (IOException ioex){
ioex.printStackTrace();
}
but end up with the following error :
Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.nio.ByteBuffer
if I try to cast that string to a ByteBuffer like so
Path photo = Paths.get("C:\\Users\\some\\picture\\path\\2634.jpeg");
try{
BasicFileAttributes attrs = Files.readAttributes(photo, BasicFileAttributes.class);
Files.setAttribute(photo, "user:tags", ByteBuffer.wrap(("test").getBytes("UTF-8")));
String attribute = Files.getAttribute(photo, "user:tags").toString();
System.out.println(attribute);
}
catch (IOException ioex){
ioex.printStackTrace();
}
instead of outputting the text 'test', it outputs the strange character string '[B#14e3f41'
What is the proper way to convert a String to a bytebuffer and have it be convertable back into a string, and is there a more customizable way to modify metadata on a File using java?
User defined attributes, that is any attribute defined by UserDefinedFileAttributeView (provided that your FileSystem supports them!), are readable/writable from Java as byte arrays; if a given attribute contains text content, it is then process dependent what the encoding will be for the string in question.
Now, you are using the .{get,set}Attribute() methods, which means that you have two options to write user attributes:
either using a ByteBuffer like you did; or
using a plain byte array.
What you will read out of it however is a byte array, always.
From the javadoc link above (emphasis mine):
Where dynamic access to file attributes is required, the getAttribute method may be used to read the attribute value. The attribute value is returned as a byte array (byte[]). The setAttribute method may be used to write the value of a user-defined attribute from a buffer (as if by invoking the write method), or byte array (byte[]).
So, in your case:
in order to write the attribute, obtain a byte array with the requested encoding from your string:
final Charset utf8 = StandardCharsets.UTF_8;
final String myAttrValue = "Mémé dans les orties";
final byte[] userAttributeValue = myAttrValue.getBytes(utf8);
Files.setAttribute(photo, "user:tags", userAttributeValue);
in order to read the attribute, you'll need to cast the result of .getAttribute() to a byte array, and then obtain a string out of it, again using the correct encoding:
final Charset utf8 = StandardCharsets.UTF_8;
final byte[] userAttributeValue
= (byte[]) Files.readAttribute(photo, "user:tags");
final String myAttrValue = new String(userAttributeValue, utf8);
A peek into the other solution, just in case...
As already mentioned, what you want to deal with is a UserDefinedFileAttributeView. The Files class allows you to obtain any FileAttributeView implementation using this method:
final UserDefinedFileAttributeView view
= Files.getFileAttributeView(photo, UserDefinedFileAttributeView.class);
Now, once you have this view at your disposal, you may read from, or write to, it.
For instance, here is how you would read your particular attribute; note that here we only use the attribute name, since the view (with name "user") is already there:
final Charset utf8 = StandardCharsets.UTF_8;
final int attrSize = view.size("tags");
final ByteBuffer buf = ByteBuffer.allocate(attrSize);
view.read("tags", buf);
return new String(buf.array(), utf8);
In order to write, you'll need to wrap the byte array into a ByteBuffer:
final Charset utf8 = StandardCharsets.UTF_8;
final int array = tagValue.getBytes(utf8);
final ByteBuffer buf = ByteBuffer.wrap(array);
view.write("tags", buf);
Like I said, it gives you more control, but is more involved.
Final note: as the name pretty much dictates, user defined attributes are user defined; a given attribute for this view may, or may not, exist. It is your responsibility to correctly handle errors if an attribute does not exist etc; the JDK offers no such thing as NoSuchAttributeException for this kind of scenario.
I have some text strings that I need to process and inside the strings there are HTML special characters. For example:
10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂
I would like to convert those characters to utf-8.
I used org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 but didn't have any luck. Is there an easy way to deal with this problem?
Apache commons-text library has the StringEscapeUtils class that has the unescapeHtml4() utility method.
String utf8Str = StringEscapeUtils.unescapeHtml4(htmlStr);
You may also need unescapeXml()
#Bohemian 's code is correct, It works for me, your un-encoded string is 10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂.
Now, I'm adding another answer instead of commenting on Bohemian's answer because there are two things that still need to be mentioned:
I copy-pasted your string into HTML code and the browser can't render your characters properly, because your String is incorrectly encoded, i. e. the string has encoded the high surrogate and the low one for two-bytes-chars separately, instead of encoding the whole codepoint (it seems the original string is a UTF-16 encoded string, maybe a Java String?).
You want the string to be re-encoded to UTF-8.
Once you have your String unencoded by StringEscapeUtils.unescapeHtml(htmlStr) (which un-encodes your string successfully despite being encoded incorrectly), it doesn't have much sense talking about "string encodings" as java strings are "unaware" about encodings. (they use UTF-16 internally though).
If you need a group of bytes containing a UTF-8 encoded "string", you need to get the "raw" bytes from a String encoded as UTF-8:
String javaStr = StringEscapeUtils.unescapeHtml(htmlStr);
byte[] rawUft8String = javaStr.getBytes("UTF-8");
And do with such byte array whatever you need.
Now if what you need is to write a UTF-8 encoded string to a File, instead of that byte array you need to specify the encoding when you create the proper java.io.Writer.
Try this code to un-encode your string (change the file path first) and then open the resulting file in any editor that supports UTF-8:
java.io.Writer approach (better):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(Writer output = new OutputStreamWriter(
new FileOutputStream("/path/to/testing.txt"), "UTF-8")) {
output.write(javaString);
}
}
java.io.OutputStream approach (if you already have a "raw string"):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(OutputStream output = new FileOutputStream("/path/to/testing.txt")) {
for (byte b : javaString.getBytes(Charset.forName("UTF-8"))) {
output.write(b);
}
}
}
Is there a possibility to get the encoding of a existing .txt file? for example: you know a customer needs a specific encoding and you want to automize the process of .sql-data delivery. then you read out the endcoding from a client config and compare it to the current encoding of the file to be delivered. if they differ you change the encoding. could not find a solution till now. any help would be appreciated.
There is no explicit declaration of text encoding in files, but you can guess the encoding by analyzing specific byte sequences that are characteristic of a certain encoding.
Chardet does exactly that and tries to guess. If it can't say for sure what the encoding is, it will give you a list with confidence values (e.g. "90% this is utf8"). The project includes both a Python module and a command line tool. For a Java version, see JChardet.
My 2cents: if you just need a quick way to detect, the command line chardet tool is the way to go.
juniversalchardet is one of the best available API for detecting the encoding type. Please checkout this link. You can go through the list of encoding types supported by it
Working Example from the site
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
byte[] buf = new byte[4096];
String fileName = args[0];
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
Hope this helps!
I need to parse a java file (actually a .pdf) to an String and go back to a file. Between those process I'll apply some patches to the given string, but this is not important in this case.
I've developed the following JUnit test case:
String f1String=FileUtils.readFileToString(f1);
File temp=File.createTempFile("deleteme", "deleteme");
FileUtils.writeStringToFile(temp, f1String);
assertTrue(FileUtils.contentEquals(f1, temp));
This test converts a file to a string and writtes it back. However the test is failing.
I think it may be because of the encodings, but in FileUtils there is no much detailed info about this.
Anyone can help?
Thanks!
Added for further undestanding:
Why I need this?
I have very large pdfs in one machine, that are replicated in another one. The first one is in charge of creating those pdfs. Due to the low connectivity of the second machine and the big size of pdfs, I don't want to synch the whole pdfs, but only the changes done.
To create patches/apply them, I'm using the google library DiffMatchPatch. This library creates patches between two string. So I need to load a pdf to an string, apply a generated patch, and put it back to a file.
A PDF is not a text file. Decoding (into Java characters) and re-encoding of binary files that are not encoded text is asymmetrical. For example, if the input bytestream is invalid for the current encoding, you can be assured that it won't re-encode correctly. In short - don't do that. Use readFileToByteArray and writeByteArrayToFile instead.
Just a few thoughts:
There might actually some BOM (byte order mark) bytes in one of the files that either gets stripped when reading or added during writing. Is there a difference in the file size (if it is the BOM the difference should be 2 or 3 bytes)?
The line breaks might not match, depending which system the files are created on, i.e. one might have CR LF while the other only has LF or CR. (1 byte difference per line break)
According to the JavaDoc both methods should use the default encoding of the JVM, which should be the same for both operations. However, try and test with an explicitly set encoding (JVM's default encoding would be queried using System.getProperty("file.encoding")).
Ed Staub awnser points why my solution is not working and he suggested using bytes instead of Strings. In my case I need an String, so the final working solution I've found is the following:
#Test
public void testFileRWAsArray() throws IOException{
String f1String="";
byte[] bytes=FileUtils.readFileToByteArray(f1);
for(byte b:bytes){
f1String=f1String+((char)b);
}
File temp=File.createTempFile("deleteme", "deleteme");
byte[] newBytes=new byte[f1String.length()];
for(int i=0; i<f1String.length(); ++i){
char c=f1String.charAt(i);
newBytes[i]= (byte)c;
}
FileUtils.writeByteArrayToFile(temp, newBytes);
assertTrue(FileUtils.contentEquals(f1, temp));
}
By using a cast between byte-char, I have the symmetry on conversion.
Thank you all!
Try this code...
public static String fetchBase64binaryEncodedString(String path) {
File inboundDoc = new File(path);
byte[] pdfData;
try {
pdfData = FileUtils.readFileToByteArray(inboundDoc);
} catch (IOException e) {
throw new RuntimeException(e);
}
byte[] encodedPdfData = Base64.encodeBase64(pdfData);
String attachment = new String(encodedPdfData);
return attachment;
}
//How to decode it
public void testConversionPDFtoBase64() throws IOException
{
String path = "C:/Documents and Settings/kantab/Desktop/GTR_SDR/MSDOC.pdf";
File origFile = new File(path);
String encodedString = CreditOneMLParserUtil.fetchBase64binaryEncodedString(path);
//now decode it
byte[] decodeData = Base64.decodeBase64(encodedString.getBytes());
String decodedString = new String(decodeData);
//or actually give the path to pdf file.
File decodedfile = File.createTempFile("DECODED", ".pdf");
FileUtils.writeByteArrayToFile(decodedfile,decodeData);
Assert.assertTrue(FileUtils.contentEquals(origFile, decodedfile));
// Frame frame = new Frame("PDF Viewer");
// frame.setLayout(new BorderLayout());
}