URL Encoding Issue with "Œ" - java

I have read all of the Java URL encoding threads on here but still haven't found a solution to my problem: Google Chrome encodes "BŒUF" to "B%8CUF" POST data, awesome. How can I convince Java to do the same? (The website is <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr"> and <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> in case this is important.)
System.out.println(URLEncoder.encode("BŒUF", "utf-8"));
System.out.println(URLEncoder.encode("BŒUF", "iso-8859-1"));
System.out.println(URLEncoder.encode("BŒUF", "iso-8859-15"));
System.out.println(new URI("http","www.google.com","/ig/api","BŒUF", null).toASCIIString());
prints
B%C5%92UF
B%3FUF
B%BCUF
http://www.google.com/ig/api?B%C5%92UF
but not "B%8CUF"?

You are specifically looking for windows-1252 encoding not UTF-8:
System.out.println(URLEncoder.encode("BŒUF", "windows-1252"));
Gives,
B%8CUF

Related

How to allow special characters like the dash in Java Application?

How to allow special characters like the dash in Java Application?
This issue is when user Copy and Paste the value from a Word Document then System shows “?” instead of dash “–” character. The copy String has dash char instead of the hyphen. Em dash and hyphen (En dash) are different characters and Em dash character is not a part of ASCII character. Its character code is \u2014. My Application does not support the dash character.
Example 1:- Not Working: Input String: "Anil – Satija" and Display Value: "Anil ? Satija"
Example 2: Name= "Anil - Satija" - hyphen is working fine and shows correct value.
Client-side technologies are Angular 1.5.5 and Server-side technologies are Spring5.2.2. Sending the request as POST. The application supports charset "UTF-8" and content-type is "application/xml". I tried to add the below code in API also to accept the all UTF-8 character set.
index.html Meta Data :
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
API Code is:
#RequestMapping(value = "/data", method = RequestMethod.POST)
#Consumes("application/xml; charset=utf-8")
public #ResponseBody
String getData(HttpServletRequest request, HttpServletResponse response) throws Throwable {
RequestPayLoad requestPayLoad = parseRequest(request);
// requestPayLoad has corrupted value.
// code
}
In debug, at the Client layer (xFactory.js) $scope.data has the correct value with dash but API layer (HttpRequest) request has the corrupted character. So, the System displays the '?' for the corrupted character in this case.

PDFTextStripper parsing with wrong encoding

PDFTextStripper stripper = new PDFText2HTML(encoding);
String result = stripper.getText(document).trim();
result contains something like
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
SeLe EE rev</title> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </head> <body> <div
style="page-break-before:always;
page-break-after:always"><div><p>&#...
instead of
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
SeLe EE rev</title> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </head> <body> <div
style="page-break-before:always; page-break-after:always"><div><p>any
blablabla characters...
When I changed encoding to windows-1252 or utf-8 result not changed. Bad pdf url http://www.permaco.ch/fileadmin/user_upload/jobs/Inserat_SeLe_EE_rev.pdf
How to parse this pdf?
How to parse this pdf?
Short of OCR'ing it you don't.
The PDF in question does not contain the information required to extract text without doing at least some OCR (at least OCR'ing each character of the used font to find a mapping from glyph to character) which would require additional libraries and code.
As a requirement for text extraction the PDF specification ISO 32000-1:2008 correctly states in section 9.10.2 that the font used for the text to extract needs to
either contain a ToUnicode CMap — the font used in your document doesn't —
or be a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection — the font used in your document isn't —
or be a simple font that uses one of the predefined encodings MacRomanEncoding,
MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font — the font used in your document neither uses one of those predefined encodings nor are the character names in its Differences array from those selections mentioned: the names used are /0, /1, ..., /155.
Generally a good first test is to try and copy&paste text using Adobe Reader as much text extraction experience is in the Reader's code. When trying to do so, you'll see that you only get garbage.

Encode-DecodeAdventure

I tried to encode special characters in javascript using encodeURI() and encodeURIComponent() functions and decode them using the java.net.URLDecoder.decode() method and this worked like a charm in firefox. but it doesn't seem to be working in Internet explorer. Is there any alternative code where the same code would work on both browsers?
Example:
when I pass $%^& as the value, after encoding, it becomes %24%25%5E%26. After decoding using java.net.URLDecoder.decode() method, it becomes $%%5E&
this is the actual value-
var str = "$%^&";
var valueJS = encodeURI(str);
var valueJS = encodeURIComponent(valueJS); // to encode even those chars in valueJS that were not encoded by encodeURI()
this is the encoded value-
String value = "%2524%2525%255E%2526";
while(value.matches(".*%25[A-Za-z0-9]*")) {
value = value.replace("%25", "%"); // manually trying to achieve %24%25%5E%26
}
value = java.net.URLDecoder(value, "UTF-8");
// I was expecting the decoded value to be $%^&, but it turns out to be $%%5E&
Fixed it. Added 2 <meta> tags
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
and used javascript escape() instead of encodeURI() and encodeURIComponent()

StreamCorruptedException on only two out of three boxes?

I'm completely at a loss as to why this is occurring. I have a javascript file deployed to three test boxes. On two of them, I get the following error when I hit the page (below). As you can see, it says the stream header is a string of zeroes, and there are no zeroes in the file it's reading. I assume this means that it's trying to read the data but nothing's coming through (hence all the unset bits...) but the fact that it works on one box but not the other two is baffling, since they all have the same code deployed to them.
I've looked through the Java service for any streams that aren't closed or read properly, but the service doesn't even open them. As you can see in the stack trace, it utilizes Ehcache which performs those operations (and are implemented correctly).
Any pointers in the right direction on what in the world this is doing?
exception
org.apache.jasper.JasperException: An exception occurred processing JSP page/CacheHistory.jsp at line 6
3: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
4:
5: <%# taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c" %>
6: <jsp:useBean id="cacheHistory" class="org.jpg.CacheHistory" />
7: <html>
8: <head>
9: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
Stacktrace:
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:519)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:428)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
root cause
net.sf.ehcache.CacheException: java.io.StreamCorruptedException: invalid stream header: 00000000
net.sf.ehcache.store.disk.DiskStorageFactory.retrieve(DiskStorageFactory.java:964)
net.sf.ehcache.store.disk.Segment.decodeHit(Segment.java:178)
net.sf.ehcache.store.disk.Segment.get(Segment.java:216)
net.sf.ehcache.store.disk.DiskStore.get(DiskStore.java:504)
net.sf.ehcache.store.disk.DiskStore.getQuiet(DiskStore.java:511)
net.sf.ehcache.store.FrontEndCacheTier.getQuiet(FrontEndCacheTier.java:196)
net.sf.ehcache.Cache.searchInStoreWithoutStats(Cache.java:2101)
net.sf.ehcache.Cache.get(Cache.java:1630)
net.sf.ehcache.Cache.get(Cache.java:1597)
root cause
java.io.StreamCorruptedException: invalid stream header: 00000000
java.io.ObjectInputStream.readStreamHeader(Unknown Source)
java.io.ObjectInputStream.<init>(Unknown Source)
net.sf.ehcache.util.PreferTCCLObjectInputStream.<init>(PreferTCCLObjectInputStream.java:39)
net.sf.ehcache.store.disk.DiskStorageFactory.read(DiskStorageFactory.java:375)
net.sf.ehcache.store.disk.DiskStorageFactory.retrieve(DiskStorageFactory.java:960)
net.sf.ehcache.store.disk.Segment.decodeHit(Segment.java:178)
net.sf.ehcache.store.disk.Segment.get(Segment.java:216)
net.sf.ehcache.store.disk.DiskStore.get(DiskStore.java:504)
net.sf.ehcache.store.disk.DiskStore.getQuiet(DiskStore.java:511)
net.sf.ehcache.store.FrontEndCacheTier.getQuiet(FrontEndCacheTier.java:196)
net.sf.ehcache.Cache.searchInStoreWithoutStats(Cache.java:2101)
net.sf.ehcache.Cache.get(Cache.java:1630)
net.sf.ehcache.Cache.get(Cache.java:1597)
The file being read is corrupt on the platforms that fail.

How to solve Flex utf-8 encoding

I develop a facebook application using flex' s XMLSocket and Java.
When i type 'ş' character in my client side, it prints, however when i send 'ş' character,
it is printed as ??? or any kind of unpredictable characters.
I tried to change my html file's meta tag to
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
but it did not work.
On the whole how can i get rid of this problem.
Thanks.
Use encodeURIComponent(yourstring), this might do the trick.

Categories