Disable auto-escaping when creating an HTML file with PrintWriter - java

I don't know why this is so much harder than expected.
I'm trying to create an HTML file with Java, and it is not working. The code creates a file, but the contents are not what I inputted.
My simplified code is as follows:
File file = new File("text.html");
PrintWriter out = null;
try {
out = new PrintWriter(file);
out.write("<b>Hello World!</b>");
} catch (Exception e) { }
out.close();
Instead of the contents "Hello World!", the HTML file contains the escaped form "<b>Hello World!</b>". When I open the file with TextWrangler, I see that Java has automatically escaped all my angle brackets into < and >, which breaks all the formatting and is NOT what I want.
How do I avoid this?

Related

Problem with Special characters in XML conversion

I don't know why my code does not want to change special characters from XML file, such as "<" , ">" to "&lt", "&gt" ??
I saw that you need to use escapeXML method, which I did. Also, I have put complete xml code to string with FileUtils.readFileToString() method - this works fine.
Can someone helps me out - what did I do wrong?
try {
File file = new File("C:\\Users\\Desktop\\project\\src\\main\\test1.xml");
s = FileUtils.readFileToString(file, "utf-8");
StringEscapeUtils.escapeXml10(s);
} catch(Exception e) {
e.printStackTrace();
}

Android (java) XmlSerializer not encoding "Windows-1252"

I'm currently working in a Android Studio project (java) and I need to export a Xml file with an specific encoding "Windows-1252".
I've been trying to do it for a few hours (EXAMPLE A) and no matter what kind of encoding I choose although the resulting file have the "correct" encoding in the first Xml line "... encoding='windows-1252'":
the chars inside the Xml file are "escaped" with "&#xxx;"
opening the file with notepad++ it detects the "UTF-8" encoding (not the desired "Windows-1252")
<?xml version='1.0' encoding='windows-1252' ?><test><message>áéíóúãõç</message></test>
To make sure that the "streams" were correct, I've created a new sample (EXAMPLE B) without the "XmlSerializer", and the result file was much better:
the chars inside the Xml file are now correct (not escaped)
opening the file with notepad++ it detects the "ANSI" encoding (not the desired "Windows-1252")
<?xml version='1.0' encoding='windows-1252' ?><test><message>áéíóúãõç</message></test>
private void doDebug01(){
File dstPath = Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOWNLOADS);
try {
//EXAMPLE A
File dstFile = new File(dstPath, "test.xml");
FileOutputStream dstFOS = new FileOutputStream(dstFile);
OutputStream dstOS= new BufferedOutputStream(dstFOS);
OutputStreamWriter dstOSW = new OutputStreamWriter(dstOS, "windows-1252");
XmlSerializer dstXml = Xml.newSerializer();
dstXml.setOutput(dstOSW);
dstXml.startDocument("windows-1252", null);
dstXml.startTag(null,"test");
dstXml.startTag(null,"message");
dstXml.text("áéíóúãõç");
dstXml.endTag(null,"message");
dstXml.endTag(null,"test");
dstXml.endDocument();
dstXml.flush();
dstOSW.close();
//EXAMPLE B
File dstFileB = new File(dstPath, "testb.xml");
FileOutputStream dstFOSB = new FileOutputStream(dstFileB);
OutputStream dstOSB= new BufferedOutputStream(dstFOSB);
OutputStreamWriter dstOSWB = new OutputStreamWriter(dstOSB, "windows-1252");
dstOSWB.write("<?xml version='1.0' encoding='windows-1252' ?>");
dstOSWB.write("<test>");
dstOSWB.write("<message>áéíóúãõç</message>");
dstOSWB.write("</test>");
dstOSWB.flush();
dstOSWB.close();
}
catch (IOException e) {
Log.e("DEBUG", e.getMessage());
e.printStackTrace();
}
}
So now I'm confused and kind of stucked here with the results I've got from EXAMPLE B because I don't know if my problem (A) resides on the "XmlSerializer" or on the "streams" parameters.
What am I missing in my code (A) to get an Xml with correct chars/file encoded in "Windows-1252" (or at least closest to EXAMPLE B)?

Read and Write file in java whilst keeping the special characters

After reading and writing the file, the bullet points get replaced with symbolic unreadable text "�". Here is the code:
String str = FileUtils.readFileToString(new File(sourcePath), "UTF-8");
nextTextFile.append(redactStrings(str, redactedStrings));
FileUtils.writeStringToFile(new File(targetPath), nextTextFile.toString());
Link to sample file
generated file with funny characters
I checked it out on Windows and if the source file is encoded in UTF-8, the following code will produce the desired output to the console and to a file, which is then encoded in UTF-8 as well, making use of java.nio:
public static void main(String[] args) {
Path inPath = Paths.get(sourcePath);
Path outPath = Paths.get(targetPath);
try {
List<String> lines = Files.readAllLines(inPath, StandardCharsets.UTF_8);
lines.forEach(line -> {
System.out.println(line);
});
Files.write(outPath, lines, StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
}
Please note that the source file has to be encoded in UTF-8, otherwise this may throw an IOException stating something like Input length = 1. Play around with the StandardCharsets if it does not meet your requirements or make sure the encoding of the source file is UTF-8.

Insert boolean in RTF file using java

I have no idea how can I insert boolean sign into RTF document from java programm. I think about √ or ✓ and –. I tried insert these signs to clear document and save it as *.rtf and then open it in Notepad++ but there is a lot of codes (~160 lines) and I can not understand what is it. Do you have any idea?
After a short search I found this:
Writing unicode to rtf file
So a final code version would be:
public void writeToFile() {
String strJapanese = "日本語✓";
try {
FileOutputStream fos = new FileOutputStream("test.rtf");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(strJapanese);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Please read about RTF
√ or ✓ and – are not available in every charset, so specify it. If yout output in UTF-8 (and i advise you to do so, check here on how to do this). You might need to encode the sign aswell, check Wikipedia

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.
Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

Categories