I'm looking to add a custom metadata tag to any type of file using functionality from java.nio.file.Files. I have been able to read metadata correctly, but am having issues whenever I try to set metadata.
I've tried to set a custom metadata element with a plain string using Files.setAttribute with the following
Path photo = Paths.get("C:\\Users\\some\\picture\\path\\2634.jpeg");
try{
BasicFileAttributes attrs = Files.readAttributes(photo, BasicFileAttributes.class);
Files.setAttribute(photo, "user:tags", "test");
String attribute = Files.getAttribute(photo, "user:tags").toString();
System.out.println(attribute);
}
catch (IOException ioex){
ioex.printStackTrace();
}
but end up with the following error :
Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.nio.ByteBuffer
if I try to cast that string to a ByteBuffer like so
Path photo = Paths.get("C:\\Users\\some\\picture\\path\\2634.jpeg");
try{
BasicFileAttributes attrs = Files.readAttributes(photo, BasicFileAttributes.class);
Files.setAttribute(photo, "user:tags", ByteBuffer.wrap(("test").getBytes("UTF-8")));
String attribute = Files.getAttribute(photo, "user:tags").toString();
System.out.println(attribute);
}
catch (IOException ioex){
ioex.printStackTrace();
}
instead of outputting the text 'test', it outputs the strange character string '[B#14e3f41'
What is the proper way to convert a String to a bytebuffer and have it be convertable back into a string, and is there a more customizable way to modify metadata on a File using java?
User defined attributes, that is any attribute defined by UserDefinedFileAttributeView (provided that your FileSystem supports them!), are readable/writable from Java as byte arrays; if a given attribute contains text content, it is then process dependent what the encoding will be for the string in question.
Now, you are using the .{get,set}Attribute() methods, which means that you have two options to write user attributes:
either using a ByteBuffer like you did; or
using a plain byte array.
What you will read out of it however is a byte array, always.
From the javadoc link above (emphasis mine):
Where dynamic access to file attributes is required, the getAttribute method may be used to read the attribute value. The attribute value is returned as a byte array (byte[]). The setAttribute method may be used to write the value of a user-defined attribute from a buffer (as if by invoking the write method), or byte array (byte[]).
So, in your case:
in order to write the attribute, obtain a byte array with the requested encoding from your string:
final Charset utf8 = StandardCharsets.UTF_8;
final String myAttrValue = "Mémé dans les orties";
final byte[] userAttributeValue = myAttrValue.getBytes(utf8);
Files.setAttribute(photo, "user:tags", userAttributeValue);
in order to read the attribute, you'll need to cast the result of .getAttribute() to a byte array, and then obtain a string out of it, again using the correct encoding:
final Charset utf8 = StandardCharsets.UTF_8;
final byte[] userAttributeValue
= (byte[]) Files.readAttribute(photo, "user:tags");
final String myAttrValue = new String(userAttributeValue, utf8);
A peek into the other solution, just in case...
As already mentioned, what you want to deal with is a UserDefinedFileAttributeView. The Files class allows you to obtain any FileAttributeView implementation using this method:
final UserDefinedFileAttributeView view
= Files.getFileAttributeView(photo, UserDefinedFileAttributeView.class);
Now, once you have this view at your disposal, you may read from, or write to, it.
For instance, here is how you would read your particular attribute; note that here we only use the attribute name, since the view (with name "user") is already there:
final Charset utf8 = StandardCharsets.UTF_8;
final int attrSize = view.size("tags");
final ByteBuffer buf = ByteBuffer.allocate(attrSize);
view.read("tags", buf);
return new String(buf.array(), utf8);
In order to write, you'll need to wrap the byte array into a ByteBuffer:
final Charset utf8 = StandardCharsets.UTF_8;
final int array = tagValue.getBytes(utf8);
final ByteBuffer buf = ByteBuffer.wrap(array);
view.write("tags", buf);
Like I said, it gives you more control, but is more involved.
Final note: as the name pretty much dictates, user defined attributes are user defined; a given attribute for this view may, or may not, exist. It is your responsibility to correctly handle errors if an attribute does not exist etc; the JDK offers no such thing as NoSuchAttributeException for this kind of scenario.
Related
As per this 3rd answer , I can write a file like this
Files.write(Paths.get("file6.txt"), lines, utf8,
StandardOpenOption.CREATE, StandardOpenOption.APPEND);
however when I try it on my code I got this error :
The method write(Path, Iterable, Charset,
OpenOption...) in the type Files is not applicable for the arguments
(Path, byte[], Charset, StandardOpenOption)
this is my code :
File dir = new File(myDirectoryPath);
File[] directoryListing = dir.listFiles();
if (directoryListing != null) {
File newScript = new File(newPath + "//newScript.pbd");
if (!newScript.exists()) {
newScript.createNewFile();
}
for (File child : directoryListing) {
if (!child.isDirectory()) {
byte[] content = null;
Charset utf8 = StandardCharsets.UTF_8;
content = readFileContent(child);
try {
Files.write(Paths.get(newPath + "\\newScript.pbd"), content,utf8,
StandardOpenOption.APPEND); <== error here in this line.
} catch (Exception e) {
System.out.println("COULD NOT LOG!! " + e);
}
}
}
}
Note if change my code to like it work and it writes into the file (remove utf8).
Files.write(Paths.get(newPath + "\\newScript.pbd"), content,
StandardOpenOption.APPEND);
Explanation
There are 3 overloads of the Files#write method (see the documentation):
Takes Path, byte[], OpenOption... (no charset)
Takes Path, Iterable<? extends CharSequence>, OpenOption... (like List<String>, no charset, uses UTF-8)
Takes Path, Iterable<? extends CharSequence>, Charset, OpenOption... (has charset)
For your call (Path, byte[], Charset, OpenOption...), no matching version exists. Thus, it does not compile.
It does not match the first and second overload, since they do not support the Charset and it does not match the third overload, since neither is an array Iterable (a class like ArrayList is), nor does byte extend CharSequence (String does).
In the error message you can see what Java computed as closest match to your call, unfortunately (as explained), it is not applicable for your arguments.
Solution
You most likely intended to go for the first overload:
Files.write(Paths.get("file6.txt"), lines,
StandardOpenOption.CREATE, StandardOpenOption.APPEND);
I.e. no charset.
Notes
A charset makes no sense in your context. The sole purpose of a charset is to convert correctly from String to binary byte[]. But your data is already binary, so the charset comes in place before that. In your case that would be the reading phase in readFileContent.
Also note that all methods in Files use UTF-8 by default already. So no need to additionally specify it anyways.
When specifying OpenOptions, you may also want to specify whether it is StandardOpenOption.READ or StandardOpenOption.WRITE mode. The Files#write method by default uses:
WRITE
CREATE
TRUNCATE_EXISTING
So you may want to call it with
WRITE
CREATE
APPEND
Examples
Here are some snippets of how to fully read and write in text and binary, in UTF-8:
// Text mode
List<String> lines = Files.readAllLines(inputPath);
// do something with lines
Files.write(outputPath, lines);
// Binary mode
byte[] content = Files.readAllBytes(inputPath);
// do something with content
Files.write(outputPath, content);
in the 3rd answer you have mentioned the content of the file is iterable, namely List. You can not use this method with a byte[]. Make your readFileContent() method return something iterable like List (each element is a line of your file).
I do some acrofield manipulation for text fields which have parent fields. This works so far, but the form also contains some checkboxes, the will not be changed. But when I store the manipulated pdf to disk and inspect the value of the checkbox, i can see that the value of cb_a.0 has been changed from ÄÖÜ?ß to ?????
My further processing fails because of this unintended change, any idea how to prevent that?
My testcase
#Test
public void changeBoxedFieldsToOne() throws IOException {
File encodingPdfFile = new File(classLoader.getResource("./prefill/TestFormEncoding.pdf").getFile());
byte[] encodingPdfByte = Files.readAllBytes(encodingPdfFile.toPath());
PdfAcrofieldManipulator pdfMani = new PdfAcrofieldManipulator(encodingPdfByte);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() > 0);
pdfMani.changeBoxedFieldsToOne();
byte[] changedPdf = pdfMani.savePdf();
Files.write(Paths.get("./build/changeBoxedFieldsToOne.pdf"), changedPdf);
pdfMani = new PdfAcrofieldManipulator(changedPdf);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() == 0);
}
public void changeBoxedFieldsToOne() {
PDDocumentCatalog docCatalog = pdDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List<PDNonTerminalField> textFieldWithMoreThan2Childrens = getTextFieldsWithMoreThan2Children();
for (PDField field : textFieldWithMoreThan2Childrens) {
int amountOfChilds = ((PDNonTerminalField) field).getChildren().size();
String currentFieldName = field.getPartialName();
LOG.info("merging fields of fieldnam {0} to one field", currentFieldName);
PDField firstChild = getChildWithPartialName((PDNonTerminalField) field, "0");
if (firstChild == null ) {
LOG.debug("found field which has a dot but starts not with 0, skipping this field");
continue;
}
PDField lastChild = getChildWithPartialName((PDNonTerminalField) field, Integer.toString(amountOfChilds - 1));
PDPage pageWhichContainsField = firstChild.getWidgets().get(0).getPage();
try {
removeField(pdDocument, currentFieldName);
} catch (IOException e) {
LOG.error("Error while removing field {0}", currentFieldName, e);
}
PDField newField = creatNewField(acroForm, field, firstChild, lastChild, pageWhichContainsField);
acroForm.getFields().add(newField);
PDAnnotationWidget newFieldWidget = createWidgetForField(newField, pageWhichContainsField, firstChild, lastChild);
try {
pageWhichContainsField.getAnnotations().add(newFieldWidget);
} catch (IOException e) {
LOG.error("error while adding new field to page");
}
}
}
public byte[] savePdf() throws IOException {
try (final ByteArrayOutputStream out = new ByteArrayOutputStream()) {
//pdDocument.saveIncremental(out);
pdDocument.save(out);
pdDocument.close();
return out.toByteArray();
}
}
I am using PDFBox 2.0.8
Here is the source PDF:https://ufile.io/gr01f or here https://www.file-upload.net/download-12928052/TestFormEncoding.pdf.html
Here the output: https://ufile.io/k8cr3 or here https://www.file-upload.net/download-12928049/changeBoxedFieldsToOne.pdf.html
This indeed is a bug in PDFBox: PDFBox cannot properly handle PDF Name objects containing bytes with values outside the US_ASCII range (in particular outside the range 0..127, and your umlauts are outside).
The first error in PDF Name handling is that PDFBox internally represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy. This is wrong, according to the PDF specification a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). [...]
Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a PDF processor. However, occasionally the need arises to treat a name object as text, such as one that represents a font name [...], a colourant name in a Separation or DeviceN colour space, or a structure type [...]
In such situations, the sequence of bytes making up the name object should be interpreted according to UTF-8, a variable-length byte-encoded representation.
Thus, it generally does not make sense to treat a name as anything else than a byte sequence. Only names used in certain contexts should be meaningful as UTF-8 encoded strings.
Furthermore, a mixed UTF-8 / CP-1252 decoding strategy, i.e. one that first tries to decode using UTF-8 and in case of failure tries again with CP-1252, can create the same string representation for different name entities, so this can indeed falsify by making unequal names equal.
This is not the problem in your case, though, the names you used can be interpreted.
The second error is, though, that while serializing the PDF it only properly encodes the characters in the strings representing names which are from US_ASCII, all else are replaced by '?':
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
[...]
}
}
(from org.apache.pdfbox.cos.COSName.writePDF(OutputStream))
This is where your checkbox values (which internally are represented by PDF Name objects) get damaged beyond repair...
A more simple example to show the problem is this:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"), "äöüß");
document.save(new File(RESULT_FOLDER, "non-ascii-name.pdf"));
document.close();
In the result the catalog with the custom entry looks like this:
1 0 obj
<<
/Type /Catalog
/Version /1.4
/Pages 2 0 R
/#3F#3F#3F#3F <E4F6FCDF>
>>
In the name key all characters are replaced by '?' in hex encoded form (#3F) while in the string value the characters are appropriately encoded.
After a bit of searching I stumbled over an answer on this topic I gave almost two years ago. Back then the PDF Name object bytes were always interpreted as UTF-8 encoded which led to issues in that question.
As a consequence the issue PDFBOX-3347 was created. To resolve it the mixed UTF-8 / CP-1252 decoding strategy was introduced. As expressed above, though, I'm not a friend of that strategy.
In that stack overflow answer I also already discussed the problems related to the use of US_ASCII during PDF serialization but that aspect has not yet been addressed at all.
Another related issue is PDFBOX-3519 but its resolution also was reduced to trying to fix the parsing of PDF Names, ignoring the serialization of it.
Yet another related issue is PDFBOX-2836.
I have a character set conversion issue:
I am updating Japanese Kanji characters in DB2 in iSeries system with the following conversion method:
AS400 sys = new AS400("<host>","username","password");
CharConverter charConv = new CharConverter(5035, sys);
byte[] b = charConv.stringToByteArray(5035, sys, "試験");
AS400Text textConverter = new AS400Text(b.length, 65535,sys);
While retrieving, I use the following code to convert & display:
CharConverter charConv = new CharConverter(5035, sys);
byte[] bytes = charConv.stringToByteArray(5035, sys, dbRemarks);
String s = new String(bytes);
System.out.println("Remarks after conversion to AS400Text :"+s);
But, the system is displaying garbled characters while displaying. Can anybody help me to decode Japanese characters from binary storage?
Well I don't know anything about CharConverter or AS400Text, but code like this is almost always a mistake:
String s = new String(bytes);
That uses the platform default encoding to convert the binary data to text.
Usually storage and retrieval should go through opposite processes - so while you've started with a string and then converted it to bytes, and converted that to an AS400Text object when storing it, I'd expect you to start with an AS400Text object, convert that to a byte array, and then convert that to a String using CharConverter when fetching. The fact that you're calling stringToByteArray in both cases suggests there's something amiss.
(It would also help if you'd tell us what dbRemarks is, and how you've fetched it.)
I do note that having checked some documentation for AS400Text, I've seen this:
Due to recent changes in the behavior of the character conversion routines, this system object is no longer necessary, except when the AS400Text object is to be passed as a parameter on a Toolbox Proxy connection.
There's similar documentation for CharConverter. Are you sure you actually need to go through this at all? Have you tried just storing the string directly and retrieving it directly, without going through intermediate steps?
Thank you Jon Skeet!
Yes. I have committed a mistake, not encoding the string while declaration.
My issue is to get the data stored in DB2, convert it into Japanese and provide for editing in web page. I am getting dbRemarks from the result set. I have missed another thing in my post:
While inserting, I am converting to text like:
String text = (String) textConverter.toObject(b);
PreparedStatement prepareStatementUpdate = connection.prepareStatement(updateSql);
prepareStatementUpdate.setString(1, text);
int count = prepareStatementUpdate.executeUpdate();
I am able to retrieve and display clearly with this code:
String selectSQL = "SELECT remarks FROM empTable WHERE emp_id = ? AND dep_id=? AND join_date='2013-11-15' ";
prepareStatement = connection.prepareStatement(selectSQL);
prepareStatement.setInt(1, 1);
prepareStatement.setString(2, 1);
ResultSet resultSet = prepareStatement.executeQuery();
while ( resultSet.next() ) {
byte[] bytedata = resultSet.getBytes( "remarks" );
AS400Text textConverter2 = new AS400Text(bytedata.length, 5035,sys);
String javaText = (String) textConverter2.toObject(bytedata);
System.out.println("Remarks after conversion to AS400Text :"+javaText);
}
It is working fine with JDBC, but for working with JPA, I need to convert to string for editing in web page or store in table. So, I have tried this way, but could not succeed:
String remarks = resultSet.getString( "remarks" );
byte[] bytedata = remarks.getBytes();
AS400Text textConverter2 = new AS400Text(bytedata.length, 5035,sys);
String javaText = (String) textConverter2.toObject(bytedata);
System.out.println("Remarks after conversion to AS400Text :"+javaText);
Thanks a lot Jon and Buck Calabro !
With your clues, I have succeeded with the following approach:
String remarks = new String(resultSet.getBytes("remarks"),"SJIS");
byte[] byteData = remarks.getBytes("SJIS");
CharConverter charConv = new CharConverter(5035, sys);
String convertedStr = charConv.byteArrayToString(5035, sys, byteData);
I am able to convert from string. I am planning to implement the same with JPA, and started coding.
I need to parse a java file (actually a .pdf) to an String and go back to a file. Between those process I'll apply some patches to the given string, but this is not important in this case.
I've developed the following JUnit test case:
String f1String=FileUtils.readFileToString(f1);
File temp=File.createTempFile("deleteme", "deleteme");
FileUtils.writeStringToFile(temp, f1String);
assertTrue(FileUtils.contentEquals(f1, temp));
This test converts a file to a string and writtes it back. However the test is failing.
I think it may be because of the encodings, but in FileUtils there is no much detailed info about this.
Anyone can help?
Thanks!
Added for further undestanding:
Why I need this?
I have very large pdfs in one machine, that are replicated in another one. The first one is in charge of creating those pdfs. Due to the low connectivity of the second machine and the big size of pdfs, I don't want to synch the whole pdfs, but only the changes done.
To create patches/apply them, I'm using the google library DiffMatchPatch. This library creates patches between two string. So I need to load a pdf to an string, apply a generated patch, and put it back to a file.
A PDF is not a text file. Decoding (into Java characters) and re-encoding of binary files that are not encoded text is asymmetrical. For example, if the input bytestream is invalid for the current encoding, you can be assured that it won't re-encode correctly. In short - don't do that. Use readFileToByteArray and writeByteArrayToFile instead.
Just a few thoughts:
There might actually some BOM (byte order mark) bytes in one of the files that either gets stripped when reading or added during writing. Is there a difference in the file size (if it is the BOM the difference should be 2 or 3 bytes)?
The line breaks might not match, depending which system the files are created on, i.e. one might have CR LF while the other only has LF or CR. (1 byte difference per line break)
According to the JavaDoc both methods should use the default encoding of the JVM, which should be the same for both operations. However, try and test with an explicitly set encoding (JVM's default encoding would be queried using System.getProperty("file.encoding")).
Ed Staub awnser points why my solution is not working and he suggested using bytes instead of Strings. In my case I need an String, so the final working solution I've found is the following:
#Test
public void testFileRWAsArray() throws IOException{
String f1String="";
byte[] bytes=FileUtils.readFileToByteArray(f1);
for(byte b:bytes){
f1String=f1String+((char)b);
}
File temp=File.createTempFile("deleteme", "deleteme");
byte[] newBytes=new byte[f1String.length()];
for(int i=0; i<f1String.length(); ++i){
char c=f1String.charAt(i);
newBytes[i]= (byte)c;
}
FileUtils.writeByteArrayToFile(temp, newBytes);
assertTrue(FileUtils.contentEquals(f1, temp));
}
By using a cast between byte-char, I have the symmetry on conversion.
Thank you all!
Try this code...
public static String fetchBase64binaryEncodedString(String path) {
File inboundDoc = new File(path);
byte[] pdfData;
try {
pdfData = FileUtils.readFileToByteArray(inboundDoc);
} catch (IOException e) {
throw new RuntimeException(e);
}
byte[] encodedPdfData = Base64.encodeBase64(pdfData);
String attachment = new String(encodedPdfData);
return attachment;
}
//How to decode it
public void testConversionPDFtoBase64() throws IOException
{
String path = "C:/Documents and Settings/kantab/Desktop/GTR_SDR/MSDOC.pdf";
File origFile = new File(path);
String encodedString = CreditOneMLParserUtil.fetchBase64binaryEncodedString(path);
//now decode it
byte[] decodeData = Base64.decodeBase64(encodedString.getBytes());
String decodedString = new String(decodeData);
//or actually give the path to pdf file.
File decodedfile = File.createTempFile("DECODED", ".pdf");
FileUtils.writeByteArrayToFile(decodedfile,decodeData);
Assert.assertTrue(FileUtils.contentEquals(origFile, decodedfile));
// Frame frame = new Frame("PDF Viewer");
// frame.setLayout(new BorderLayout());
}
How to get file type extension from byte[] (Blob). I'm reading files from DB to byte[] but i don't know how to automatically detect file extension.
Blob blob = rs.getBlob(1);
byte[] bdata = blob.getBytes(1, (int) blob.length());
You mean you want to get the extension of the file for which the blob store the content? So if the BLOB stores the content of a jpeg-file, you want "jpg"?
That's generally speaking not possible. You can make a fairly good guess by using some heuristic such as Apache Tikas content detection.
A better solution however, would be to store the mime type (or original file extension) in a separate column, such as a VARCHAR.
It's not perfect, but the Java Mime Magic library may be able to infer the file extension:
Magic.getMagicMatch(bdata).getExtension();
Try with ByteArrayDataSource (http://download.oracle.com/javaee/5/api/javax/mail/util/ByteArrayDataSource.html) you will find getContentType() method there, which should help but I've never tried it personally.
if(currentImageType ==null){
ByteArrayInputStream is = new ByteArrayInputStream(image);
String mimeType = URLConnection.guessContentTypeFromStream(is);
if(mimeType == null){
AutoDetectParser parser = new AutoDetectParser();
Detector detector = parser.getDetector();
Metadata md = new Metadata();
mimeType = detector.detect(is,md).toString();
if (mimeType.contains("pdf")){
mimeType ="pdf";
}
else if(mimeType.contains("tif")||mimeType.contains("tiff")){
mimeType = "tif";
}
}
if(mimeType.contains("png")){
mimeType ="png";
}
else if( mimeType.contains("jpg")||mimeType.contains("jpeg")){
mimeType = "jpg";
}
else if (mimeType.contains("pdf")){
mimeType ="pdf";
}
else if(mimeType.contains("tif")||mimeType.contains("tiff")){
mimeType = "tif";
}
currentImageType = ImageType.fromValue(mimeType);
}
An alternative to using a separate column is using Magic Numbers. Here is some pseudo code:
getFileExtn(BLOB)
{
PNGMagNum[] = {0x89, 0x50, 0x4E, 0x47}
if(BLOB[0:3] == PNGMagNum)
return ".png"
//More checks...
}
You would have to do this for every file type you support. Some obscure file types you might have to find out yourself via a hex editor (the magic number is always the first few bytes of code). The benefit of using the magic number is you get the actual file type, and not what the user just decided to name it.
There is decent method in JDK's URLConnection class, please refer to following answer:
Getting A File's Mime Type In Java