JSoup seems to ignore character codes?

JSoup seems to ignore character codes? - java

I'm building a small CMS-like application in Java, that takes a .txt file with shirt names/descriptions and loads the names/descriptions into an ArrayList of customShirts (small class I made). Then, it iterates through the ArrayList, and uses JSoup to parse a template (template.html) and insert the unique details of the shirt into the HTML. Finally, it pumps out each shirt into its own HTML file in an output folder.
When the descriptions are loaded into the ArrayList of customShirts, I replace all special characters with the appropriate character codes so they can be properly displayed (for example, replacing apostrophes with ’). The issue is, I've noticed that JSoup seems to automatically turn the character codes into the actual character, which is an issue, since I need the output to be displayable (which requires character codes). Is there anything I can do to fix this? I've looked at other workarounds, like at: Jsoup unescapes special characters, but they seem to require parsing the file before insertion with replaceAll, and I insert the character-code sensitive text with JSoup, which doesn't seem to make this an option.
Below is the code for the HTML generator I made:
public void generateShirtHTML(){
for(int i = 0; i < arrShirts.size(); i++){
File input = new File("html/template/template.html");
Document doc = null;
try {
doc = Jsoup.parse(input, "UTF-8", "");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Element title = doc.select("title").first();
title.append(arrShirts.get(i).nameToCapitalized());
Element headingTitle = doc.select("h1#headingTitle").first();
headingTitle.html(arrShirts.get(i).nameToCapitalized());
Element shirtDisplay = doc.select("p#alt1").first();
shirtDisplay.html(arrShirts.get(i).name);
Element descriptionBox = doc.select("div#descriptionbox p").first();
descriptionBox.html(arrShirts.get(i).desc);
System.out.println(arrShirts.get(i).desc);
PrintWriter output;
try {
output = new PrintWriter("html/output/" + arrShirts.get(i).URL);
output.println(doc.outerHtml());
//System.out.println(doc.outerHtml());
output.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Shirt " + i + " HTML generated!");
}
}
Thanks in advance!

Expanding a little on my comment (since Stephan encouraged me..), you can use
doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
To tell Jsoup to escape / encode special characters in the output, eg. left double quotes (“) as “. To make Jsoup encode all special characters, you may also need to add
doc.outputSettings().charset("ASCII");
In order to ensure that all Unicode special characters will be HTML encoded.
For larger projects where you have to fill in data into HTML files, you can look at using a template engine such as Thymeleaf - it's easier to use for this kind of job (less code and such), and it offers many more features specifically for this process. For small projects (like yours), Jsoup is good (I've used it like this in the past), but for bigger (or even small) projects, you'll want to look into some more specialized tools.

Related

Can't use latin characters correctly when creating new files from within Java. Filenames get weird characters instead of the correct ones

Currently saving an int[] from hashmap in a file with the name of the key to the int[]. This exact key must be reachable from another program. Hence I can't switch name of the files to english only chars. But even though I use ISO_8859_1 as the charset for the filenames the files get all messed up in the file tree. The english letters are correct but not the special ones.
/**
* Save array to file
*/
public void saveStatus(){
try {
for(String currentKey : hmap.keySet()) {
byte[] currentKeyByteArray = currentKey.getBytes();
String bytesString = new String(currentKeyByteArray, StandardCharsets.ISO_8859_1);
String fileLocation = "/var/tmp/" + bytesString + ".dat";
FileOutputStream saveFile = new FileOutputStream(fileLocation);
ObjectOutputStream out = new ObjectOutputStream(saveFile);
out.writeObject(hmap.get(currentKey));
out.close();
saveFile.close();
System.out.println("Saved file at " + fileLocation);
}
} catch (IOException e) {
e.printStackTrace();
}
}
Could it have to do with how linux is encoding characters or is more likely to do with the Java code?
EDIT
I think the problem lies with the OS. Because when looking at text files with cat for example the problem is the same. However vim is able to decode the letters correctly. In that case I would have to perhaps change the language settings from the terminal?

You have to change the charset in the getBytes function as well.
currentKey.getBytes(StandardCharsets.ISO_8859_1);
Also, why are you using StandardCharsets.ISO_8859_1? To accept a wider range of characters, use StandardCharsets.UTF_8.

The valid characters of a filename or path vary depending on the file system used. While it should be possible to just use a java string as filename (as long as it does not contain characters invalid in the given file system), there might be interoperability issues and bugs.
In other words, leave out all Charset-magic as #RealSkeptic recommends and it should work. But changing the environment might result in unexpected behavior.
Depending on your requirements, you might therefore want to encode the key to make sure it only uses a reduced character set. One variant of Base64 might work (assuming your file system is case sensitive!). You might even find a library (Apache Commons?) offering a function to reduce a string to characters safe for use in a file name.

poi XWPF double space

I am using the apache POI.XWPF library to create word documents. For the last couple of days I have been searching how to do double spaces for the whole document. I have checked the Apache javadocs and just by searching the internet but can't seem to find any answers.
I found the addBreak() method but it won't work because the user will input multiple paragraphs and breaking those paragraphs into single lines seemed unreasonable. If this method is used per paragraph then it won't create the double space between each line but between each paragraph.
Here is a small part of the code I currently have.
public class Paper {
public static void main(String[] args) throws IOException, XmlException {
ArrayList<String> para = new ArrayList<String>();
para.add("The first paragraph of a typical business letter is used to state the main point of the letter. Begin with a friendly opening; then quickly transition into the purpose of your letter. Use a couple of sentences to explain the purpose, but do not go in to detail until the next paragraph.");
para.add("Beginning with the second paragraph, state the supporting details to justify your purpose. These may take the form of background information, statistics or first-hand accounts. A few short paragraphs within the body of the letter should be enough to support your reasoning.");
XWPFDocument document = new XWPFDocument();
//Calls on createParagraph() method which creates a single paragraph
for(int i=0; i< para.size(); i++){
createParagraph(document, para.get(i));
}
FileOutputStream outStream = null;
try {
outStream = new FileOutputStream("ResearchPaper.docx");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
document.write(outStream);
outStream.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
//Creates a single paragraph with a one tab indentation
private static void createParagraph(XWPFDocument document, String para) {
XWPFParagraph paraOne = document.createParagraph();
paraOne.setFirstLineIndent(700); // Indents first line of paragraph to the equivalence of one tab
XWPFRun one = paraOne.createRun();
one.setFontSize(12);
one.setFontFamily("Times New Roman");
one.setText(para);
}
}
Just to make sure my question is clear, I am trying to find out how to double space a word document (.docx). So between each line there should be one line of space. This is the same thing as pressing ctrl+2 when editing a word document.
Thank you for any help.

It appears that there isn't a high level method available for what you're trying to achieve. In that case, you'll need to delve into the low-level API of Apache POI. Below is a way of doing that. I'm not saying this is the best way to go about it, I've only found that it works for me when I want to recreate some bizarre feature of MS Word.
1. Find out where the information is stored in the document
If you need to tweak something manually, create 2 documents with as little content as possible: one that contains what you want to do, and one that doesn't. Save both as Office XML documents, because that makes it easy to read. Diff those files - there should only be a handful of changes, and you should have your location in the document structure.
For your case, this is the thing you're looking for.
Unmodified document:
<w:body><w:p> <!-- only included here so you know where to look -->
<w:pPr>
<w:jc w:val="both" />
<w:rPr>
<w:lang w:val="nl-BE" />
</w:rPr>
</w:pPr>
Modified document:
<w:body><w:p>
<w:pPr>
<w:spacing w:line="480" w:lineRule="auto" /> <!-- BINGO -->
<w:jc w:val="both" />
<w:rPr>
<w:lang w:val="nl-BE" />
</w:rPr>
</w:pPr>
So now you know that you need an object called spacing, that it has some properties, and that it's stored somewhere in the Paragraph object.
2. Get to that location through the low-level API, if at all possible
This part is tricky, because the XML node names are somewhat cryptic and maybe you don't know the terminology very well. Also the API isn't always a 1:1 mapping of the node names, so you have to take some guesses and just try to step through the method calls. Pro tip: download the source code for Apache POI !! You WILL step into some dead ends and you might not get where you want to be by the shortest path, but when you do you feel like an arcane Master of POI. And then you write gloating posts about it on Q&A sites.
For your case in MS Word, this is a path you might take (not necessarily the best, I'm not an expert on the high-level API):
// you probably don't need this first line
// but you'd need it if you were manipulating an existing document
IBody body = doc.getBodyElements().get(0).getBody();
for (XWPFParagraph par : body.getParagraphs()) {
// work the crazy abbreviated API magic
CTSpacing spacing = par.getCTP().getPPr().getSpacing();
if (spacing == null) {
// it looks hellish to create a CTSpacing object yourself
// so let POI do it by setting any Spacing parameter
par.setSpacingLineRule(LineSpacingRule.AUTO);
// now the Paragraph's spacing shouldn't be null anymore
spacing = par.getCTP().getPPr().getSpacing();
}
// you can set your value, as demonstrated by the XML
spacing.setAfter(BigInteger.valueOf(480));
// not sure if this one is necessary
spacing.setLineRule(STLineSpacingRule.Enum.forString("auto"));
}
3. Bask in your glory !

how to get image using html parsing with jsoup

I want get all images using html parsing with jsoup.
I use below code ;
Elements images = doc.select("img[src~=(?i)\\.(jpe?g)]");
for (Element image : images) {
//System.out.println("\nsrc : " + image.attr("src"));
arrImageItem.add(image.attr("src"));
}
I parse this method all images but i want to parse this url
http://tvrehberi.hurriyet.com.tr/images/742/403742.jpg
I want to parse beginnig of this url
http://tvrehberi.hurriyet.com.tr/images .... .jpg
How to get parse like this ?

This will probably give you what you ask for, though your question is a bit unclear, so I can't be sure.
public static void main(String args[]){
Document doc = null;
String url = "http://tvrehberi.hurriyet.com.tr";
try {
doc = Jsoup.connect(url).get();
} catch (IOException e1) {
e1.printStackTrace();
}
for (Element e : doc.select("img[src~=(?i)\\.(jpe?g)]")) {
if(e.attr("src").startsWith("http://tvrehberi.hurriyet.com.tr/images")){
System.out.println(e.attr("src"));
}
}
}
So, this might not be a very "clean" solution, but the if-statement will make sure it only prints out the image URL's from the /images/-directory on the server.

If I understood correctly, you want to retrieve the URL path up to a certain point and cut off the rest. Do you even have to do that every time?
If you are only using URLs from the one site in your example, you could store "http://tvrehberi.hurriyet.com.tr/images" as a constant since it never changes. If, on the other hand, you fetch URLs from many different sites, you can parse your URL as described here.
Anyway, if you shared the purpose of parsing the URLs, we certainly could help you more.

Saxparser exception in Android

I am parsing data from this XML url The text input varies, depending on the user. Whenever there are spaces in the text variable I get this exception:
org.apache.harmony.xml.ExpatParser$ParseException: At line 11, column 2: mismatched tag
org.apache.harmony.xml.ExpatParser.parseFragment(ExpatParser.java:520)
org.apache.harmony.xml.ExpatParser.parseDocument(ExpatParser.java:479)
org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:318)
org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:275)
gps.app.tkartor.XMLObjects.XMLParser.findStreet(XMLParser.java:99)
If there are no spaces it works fine. Parsing code:
public void findStreet(String searchWord) {
try {
url = new URL(
"http://maps.travelsouthyorkshire.com/iGNMSearchService.asmx/TextSearch?text="
+ searchWord + "&maxResults=100");
System.out.println(url.toString());
parserFactory = SAXParserFactory.newInstance();
parser = parserFactory.newSAXParser();
reader = parser.getXMLReader();
streetHandler = new StreetHandler();
reader.setContentHandler(streetHandler);
reader.parse(new InputSource(url.openStream()));//line 99
poi = streetHandler.getAllStreets();
} catch (Exception e) {
e.printStackTrace();
}
}

Indicate spaces with '+'
http://maps.travelsouthyorkshire.com/iGNMSearchService.asmx/TextSearch?text=West+bar&maxResults=100
So in your case you have to replace the spaces of your String "searchWord". Something like
searchWord = searchWord.replace(" ", "+");

When I view both of those URLs using my web browser, I get valid XML. Certainly, there is no error at line 11 as indicated by your stacktrace.
My conclusion is that fetching the URLs (particularly the one with the space in it) is giving a different result when you do it programmatically versus doing it in a web browser. I expect that is because the browser is "helpfully" fixing the URL to encode the space character before sending it. (And I suspect that you pasted that fixed URL into the question ...)
To confirm this diagnosis, you need to capture and view the actual file contents that your Android app gets from the server. My guess is that it is actually an HTML error page. That typically won't be valid XML, and hence the XML parse error.
If this turns out to be the problem, then you need to correctly encode the search string before embedding it in the URL. If you were using plain Java, then I'd suggest using URLEncoder.encode, or assembling the URL from its components using the URI class. There may be a better way on the Android platform ...

Escaping large number of characters for display on XHTML web page via Java

I have an embedded device which runs Java applications which can among other things serve up XHTML web pages (I could write the pages as something other than XHTML, but I'm aiming for that for now).
When a request for a web page handled by my application is received a method is called in my code with all the information on the request including an output stream to display the page.
On one of my pages I would like to display a (log) file, which can be up to 1 MB in size.
I can display this file unescaped using the following code:
final PrintWriter writer; // Is initialized to a PrintWriter writing to the output stream.
final FileInputStream fis = new FileInputStream(file);
final InputStreamReader inputStreamReader = new InputStreamReader(fis);
try {
writer.println("<div id=\"log\" style=\"white-space: pre-wrap; word-wrap: break-word\">");
writer.println(" <pre>");
int length;
char[] buffer = new char[1024];
while ((length = inputStreamReader.read(buffer)) != -1) {
writer.write(buffer, 0, length);
}
writer.println(" </pre>");
writer.println("</div>");
} finally {
if (inputStreamReader != null) {
inputStreamReader.close();
}
}
This works reasonably well, and displays the entire file within a second or two (an acceptable timeframe).
This file can (and in practice, does) contain characters which are invalid XHTML, most commonly <>. So I need to find a way to escape these characters.
The first thing I tried was a CDATA section, but as documented here they do not display correctly in IE8.
The second thing I tried was a method like the following:
// Based on code: https://stackoverflow.com/questions/439298/best-way-to-encode-text-data-for-xml-in-java/440296#440296
// Modified to write directly to the stream to avoid creating extra objects.
private static void writeXmlEscaped(PrintWriter writer, char[] buffer, int offset, int length) {
for (int i = offset; i < length; i++) {
char ch = buffer[i];
boolean controlCharacter = ch < 32;
boolean unicodeButNotAscii = ch > 126;
boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';
if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
writer.write("&#" + (int) ch + ";");
} else {
writer.write(ch);
}
}
}
This correctly escapes the characters (I was going to expand it to escape HTML invalid characters if needed), but the web page then takes 15+ seconds to display and other resources on the page (images, css stylesheet) intermittently fail to load (I believe due to the requests for them timing out because the processor is pegged).
I've tried using a BufferedWriter in front of the PrintWriter as well as changing the buffer size (both for reading the file and for the BufferedWriter) in various ways, with no improvement.
Is there a way to escape all XHTML invalid characters that does not require iterating over every single character in the stream? Failing that is there a way to speed up my code enough to display these files within a couple seconds?
I'll consider reducing the size of the log files if I have to, but I was hoping to make them at least 250-500 KB in size (with 1 MB being ideal).
I already have a method to simply download the log files, but I would like to display them in browser as well for simple troubleshooting/perusal.
If there's a way to set the headers so that IE8/Firefox will simply display the file in browser as a text file I would consider that as an alternative (and have an entire page dedicated to the file with no XHTML of any kind).
EDIT:
After making the change suggested by Cameron Skinner and performance testing it looks like the escaped writing takes about 1.5-2x as long as the block-written version. It's not nothing, but I'm probably not going to be able to get a huge speedup by messing with it.
I may just need to reduce the max size of the log file.

One small change that will (well, might) significantly increase the speed is to change
writer.write("&#" + (int) ch + ";");
to
writer.write("&#");
writer.write((int)ch);
writer.write(";");
String concatenation is extremely expensive as Java allocates a new temporary string buffer for each + operator, so you are generating two temporary buffers each time there is a character that needs replacing.
EDIT: One of the comments on another answer is highly relevant: find where the slow bit is first. I'd suggest testing logs that have no characters to be escaped and many characters to be escaped.
I think you should make the suggested change anyway because it costs you only a few seconds of your time.

You can try StringEscapeUtils from commons-lang:
StringEscapeUtils.escapeHtml(writer, string);

One option is for you to serve up the log contents inside of an iframe hosted inside of your web page. The iframe's source could point to a URL that serves up the content as text.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup seems to ignore character codes? - java

Related

Can't use latin characters correctly when creating new files from within Java. Filenames get weird characters instead of the correct ones

poi XWPF double space

how to get image using html parsing with jsoup

Saxparser exception in Android

Escaping large number of characters for display on XHTML web page via Java

Categories

Resources