Encode-DecodeAdventure - java

I tried to encode special characters in javascript using encodeURI() and encodeURIComponent() functions and decode them using the java.net.URLDecoder.decode() method and this worked like a charm in firefox. but it doesn't seem to be working in Internet explorer. Is there any alternative code where the same code would work on both browsers?
Example:
when I pass $%^& as the value, after encoding, it becomes %24%25%5E%26. After decoding using java.net.URLDecoder.decode() method, it becomes $%%5E&
this is the actual value-
var str = "$%^&";
var valueJS = encodeURI(str);
var valueJS = encodeURIComponent(valueJS); // to encode even those chars in valueJS that were not encoded by encodeURI()
this is the encoded value-
String value = "%2524%2525%255E%2526";
while(value.matches(".*%25[A-Za-z0-9]*")) {
value = value.replace("%25", "%"); // manually trying to achieve %24%25%5E%26
}
value = java.net.URLDecoder(value, "UTF-8");
// I was expecting the decoded value to be $%^&, but it turns out to be $%%5E&

Fixed it. Added 2 <meta> tags
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
and used javascript escape() instead of encodeURI() and encodeURIComponent()

Related

How to allow special characters like the dash in Java Application?

How to allow special characters like the dash in Java Application?
This issue is when user Copy and Paste the value from a Word Document then System shows “?” instead of dash “–” character. The copy String has dash char instead of the hyphen. Em dash and hyphen (En dash) are different characters and Em dash character is not a part of ASCII character. Its character code is \u2014. My Application does not support the dash character.
Example 1:- Not Working: Input String: "Anil – Satija" and Display Value: "Anil ? Satija"
Example 2: Name= "Anil - Satija" - hyphen is working fine and shows correct value.
Client-side technologies are Angular 1.5.5 and Server-side technologies are Spring5.2.2. Sending the request as POST. The application supports charset "UTF-8" and content-type is "application/xml". I tried to add the below code in API also to accept the all UTF-8 character set.
index.html Meta Data :
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
API Code is:
#RequestMapping(value = "/data", method = RequestMethod.POST)
#Consumes("application/xml; charset=utf-8")
public #ResponseBody
String getData(HttpServletRequest request, HttpServletResponse response) throws Throwable {
RequestPayLoad requestPayLoad = parseRequest(request);
// requestPayLoad has corrupted value.
// code
}
In debug, at the Client layer (xFactory.js) $scope.data has the correct value with dash but API layer (HttpRequest) request has the corrupted character. So, the System displays the '?' for the corrupted character in this case.

URL Encoding Issue with "Œ"

I have read all of the Java URL encoding threads on here but still haven't found a solution to my problem: Google Chrome encodes "BŒUF" to "B%8CUF" POST data, awesome. How can I convince Java to do the same? (The website is <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr"> and <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> in case this is important.)
System.out.println(URLEncoder.encode("BŒUF", "utf-8"));
System.out.println(URLEncoder.encode("BŒUF", "iso-8859-1"));
System.out.println(URLEncoder.encode("BŒUF", "iso-8859-15"));
System.out.println(new URI("http","www.google.com","/ig/api","BŒUF", null).toASCIIString());
prints
B%C5%92UF
B%3FUF
B%BCUF
http://www.google.com/ig/api?B%C5%92UF
but not "B%8CUF"?
You are specifically looking for windows-1252 encoding not UTF-8:
System.out.println(URLEncoder.encode("BŒUF", "windows-1252"));
Gives,
B%8CUF

PDFTextStripper parsing with wrong encoding

PDFTextStripper stripper = new PDFText2HTML(encoding);
String result = stripper.getText(document).trim();
result contains something like
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
SeLe EE rev</title> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </head> <body> <div
style="page-break-before:always;
page-break-after:always"><div><p>&#...
instead of
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
SeLe EE rev</title> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </head> <body> <div
style="page-break-before:always; page-break-after:always"><div><p>any
blablabla characters...
When I changed encoding to windows-1252 or utf-8 result not changed. Bad pdf url http://www.permaco.ch/fileadmin/user_upload/jobs/Inserat_SeLe_EE_rev.pdf
How to parse this pdf?
How to parse this pdf?
Short of OCR'ing it you don't.
The PDF in question does not contain the information required to extract text without doing at least some OCR (at least OCR'ing each character of the used font to find a mapping from glyph to character) which would require additional libraries and code.
As a requirement for text extraction the PDF specification ISO 32000-1:2008 correctly states in section 9.10.2 that the font used for the text to extract needs to
either contain a ToUnicode CMap — the font used in your document doesn't —
or be a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection — the font used in your document isn't —
or be a simple font that uses one of the predefined encodings MacRomanEncoding,
MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font — the font used in your document neither uses one of those predefined encodings nor are the character names in its Differences array from those selections mentioned: the names used are /0, /1, ..., /155.
Generally a good first test is to try and copy&paste text using Adobe Reader as much text extraction experience is in the Reader's code. When trying to do so, you'll see that you only get garbage.

How to solve Flex utf-8 encoding

I develop a facebook application using flex' s XMLSocket and Java.
When i type 'ş' character in my client side, it prints, however when i send 'ş' character,
it is printed as ??? or any kind of unpredictable characters.
I tried to change my html file's meta tag to
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
but it did not work.
On the whole how can i get rid of this problem.
Thanks.
Use encodeURIComponent(yourstring), this might do the trick.

Special Characters In Webapp being saved differently

I'm creating a webapp using Spring MVC and some of the information I'm pulling is from a Database, so it was edited elsewhere. When I import some have, what I consider, special characters, such as
“_blank”
as opposed to using the standard keyboard
"_blank".
When I display this on my website textarea, it displays fine, but when I attempt to save it back into the string when submitting the form in the spring textArea, the string now has ? where the 'special' characters were. They were obviously imported into a String fine, but somewhere in the save process it's not allowing it as a special character. Any idea what is causing this or why?
Sounds like a character encoding problem. Try setting the character set of the page containing the form to UTF-8.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Categories