How to write to a StyledDocument with a specific charset?

How to write to a StyledDocument with a specific charset? - java

for a NetBeans plugin I want to change the content of a file (which is opened in the NetBeans editor) with a specific String and a specific charset. In order to achieve that, I open the file (a DataObject) with an EditorCookie and then I change the content by inserting a different string to the StyledDocument of my data object.
However, I have a feeling that the file is always saved as UTF-8. Even if I write a file mark in the file. Am I doing something wrong?
This is my code:
...
EditorCookie cookie = dataObject.getLookup().lookup(EditorCookie.class);
String utf16be = new String("\uFEFFHello World!".getBytes(StandardCharsets.UTF_16BE));
NbDocument.runAtomic(cookie.getDocument(), () -> {
try {
StyledDocument document = cookie.openDocument();
document.remove(0, document.getLength());
document.insertString(0, utf16be, null);
cookie.saveDocument();
} catch (BadLocationException | IOException ex) {
Exceptions.printStackTrace(ex);
}
});
I have also tried this approach which doesn't work too:
...
EditorCookie cookie = dataObject.getLookup().lookup(EditorCookie.class);
NbDocument.runAtomic(cookie.getDocument(), () -> {
try {
StyledDocument doc = cookie.openDocument();
String utf16be = "\uFEFFHello World!";
InputStream is = new ByteArrayInputStream(utf16be.getBytes(StandardCharsets.UTF_16BE));
FileObject fileObject = dataObject.getPrimaryFile();
String mimePath = fileObject.getMIMEType();
Lookup lookup = MimeLookup.getLookup(MimePath.parse(mimePath));
EditorKit kit = lookup.lookup(EditorKit.class);
try {
kit.read(is, doc, doc.getLength());
} catch (IOException | BadLocationException ex) {
Exceptions.printStackTrace(ex);
} finally {
is.close();
}
cookie.saveDocument();
} catch (Exception ex) {
Exceptions.printStackTrace(ex);
}
});

Your problem is probably here:
String utf16be = new String("\uFEFFHello World!".getBytes(StandardCharsets.UTF_16BE));
This won't do what you think it does. This will convert your string to a byte array using the UTF-16 little endian encoding and then create a String from these bytes using the JRE's default encoding.
So, here's the catch:
A String has no encoding.
The fact that in Java this is a sequence of chars does not matter. Substitute 'char' for 'carrier pigeons', the net effect will be the same.
If you want to write a String to a byte stream with a given encoding, you need to specify the encoding you need on the Writer object you create. Similarly, if you want to read a byte stream into a String using a given encoding, it is the Reader which you need to configure to use the encoding you want.
But your StyledDocument object's method name is .insertString(); You should .insertString() your String object as is; don't transform it the way you do, since this is misguided, as explained above.

Related

UTF-16LE encoding and xerces2 Java

I went through a few posts, like FileReader reads the file as a character stream and can be treated as whitespace if the document is handed as a stream of characters where the answers say the input source is actually a char stream, not a byte stream.
However, the suggested solution from 1 does not seem to apply to UTF-16LE. Although I use this code:
try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}
I still get org.xml.sax.SAXParseException: Content is not allowed in prolog..
I looked at Files.newInputStream, and it indeed uses a ChannelInputStream which will hand over bytes, not chars. I also tried to set the Encoding of the InputSource object, but with no luck.
I also checked that there are not extra chars (except the BOM) before the <?xml part.
I also want to mention that this code works just fine with UTF-8.
// Edit:
I also tried DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() and XmlInputStreamReader.next(), same results.
// Edit 2:
Tried using a buffered reader. Same results:
Unexpected character '뿯' (code 49135 / 0xbfef) in prolog; expected '<'
Thanks in advance.

To get a bit farther some info gathering:
byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
LOG.info("Has BOM and is evidently UTF_16LE");
xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);
try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}

How to properly open a png file

I am trying to attach a png file. Currently when I sent the email, the attachment is 2x bigger than the file should be and an invalid png file. Here is the code I currently have:
import com.sendgrid.*;
Attachments attachments = new Attachments();
String filePath = "/Users/david/Desktop/screenshot5.png";
String data = "";
try {
data = new String(Files.readAllBytes(Paths.get(filePath)));
} catch (IOException e) {
}
byte[] encoded = Base64.encodeBase64(data.getBytes());
String encodedString = new String(encoded);
attachments.setContent(encodedString);
Perhaps I am encoding the data incorrectly? What would be the correct way to 'get' the data to attach it?

With respect, this is why Python presents a problem to modern developers. It abstracts away important concepts that you can't fully understand in interpreted languages.
First, and this is a relatively basic concept, but you can't convert arbitrary byte sequences to a string and hope it works out. The following line is your first problem:
data = new String(Files.readAllBytes(Paths.get(filePath)));
EDIT: It looks like the library you are using expects the file to be base64 encoded. I have no idea why. Try changing your code to this:
Attachments attachments = new Attachments();
String filePath = "/Users/david/Desktop/screenshot5.png";
try {
byte[] encoded = Base64.encodeBase64(Files.readAllBytes(Paths.get(filePath)));
String encodedString = new String(encoded);
attachments.setContent(encodedString);
} catch (IOException e) {
}
The only issue you were having is that you were trying to represent arbitrary bytes as a string.

Take a look at the Builder class in the repository here. Example:
FileInputStream fileContent = new FileInputStream(filePath);
Attachments.Builder builder = new Attachments.Builder(fileName, fileContent);
mail.addAttachments(builder.build());

Insert boolean in RTF file using java

I have no idea how can I insert boolean sign into RTF document from java programm. I think about √ or ✓ and –. I tried insert these signs to clear document and save it as *.rtf and then open it in Notepad++ but there is a lot of codes (~160 lines) and I can not understand what is it. Do you have any idea?

After a short search I found this:
Writing unicode to rtf file
So a final code version would be:
public void writeToFile() {
String strJapanese = "日本語✓";
try {
FileOutputStream fos = new FileOutputStream("test.rtf");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(strJapanese);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}

Please read about RTF
√ or ✓ and – are not available in every charset, so specify it. If yout output in UTF-8 (and i advise you to do so, check here on how to do this). You might need to encode the sign aswell, check Wikipedia

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.

Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

Read email contents using Apache Commons Email

I've different properties file as shown below:
abc_en.properties
abc_ch.properties
abc_de.properties
All of these contain HTML tags & some static contents along with some image urls.
I want to send email message using apache commons email & I'm able to compose the name of the template through Java using locale as well.
String name = abc_ch.properties;
Now, how do I read it to send it as a Html Msg parameter using Java?
HtmlEmail e = new HtmlEmail();
e.setHostName("my.mail.com");
...
e.setHtmlMsg(msg);
How do I get the msg param to get the contents from the file? Any efficient & nice solun?
Can any one provide sample java code?
Note: The properties file has dynamic entries for username & some other fields like Dear ,....How do I substitute those dynamically?
Thanks

I would assume that *.properties is a text file.
If so, then do a File read into a String
eg:
String name = getContents(new java.io.File("/path/file.properties");
public static String getContents(File aFile) {
StringBuffer contents = new StringBuffer();
BufferedReader input = null;
try {
InputStreamReader fr=new InputStreamReader(new FileInputStream(aFile), "UTF8");
input = new BufferedReader( fr );
String line = null;
while (( line = input.readLine()) != null){
contents.append(line);
contents.append(System.getProperty("line.separator"));
}
}
catch (FileNotFoundException ex) {
//ex.printStackTrace();
}
catch (IOException ex){
//ex.printStackTrace();
}
finally {
try {
if (input!= null) {
input.close();
}
}
catch (IOException ex) {
//ex.printStackTrace();
}
}
return contents.toString();
}
regards

Hi Mike,
Well, I kind of guess that you are trying to send mails in multiple languages by rendering the elements from different property files at runtime. Also, you said "locale". Are you using the concept of "Resource Bundles )"? Well, in that case before you send mails,
1)You need to understand the naming conventions for naming a property file, without which the java compiler will not be able to load the appropriate property file at run time.
For this read the first page on the Resource Bundles page.
2) Once your naming conventions is fine, you can load the appropriate prop file like this:
Locale yourLocale = new Locale("en", "US");
ResourceBundle rb = ResourceBundle.getBundle("resourceBundleFileName", yourLocale);
3) Resource Bundle property file is nothing but a (Key,Value) pairs. Hence you can retrieve the value of a key like this:
String dearString = rb.getString("Dear");
String emailBody= rb.getString("emailBody");
4) You can later use this values for setting the attributes in your commons-email api.
Hope you find this useful!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to write to a StyledDocument with a specific charset? - java

Related

UTF-16LE encoding and xerces2 Java

How to properly open a png file

Insert boolean in RTF file using java

Special characters are not converted correctly from pdf to text

Read email contents using Apache Commons Email

Categories

Resources