How to convert non-supported character to html entity in Java

How to convert non-supported character to html entity in Java - java

Some character not support by certain charset, so below test fail. I would like to use html entity to encode ONLY those not supported character. How, in java?
public void testWriter() throws IOException{
String c = "\u00A9";
String encoding = "gb2312";
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
Writer writer = new BufferedWriter(new OutputStreamWriter(outStream, encoding));
writer.write(c);
writer.close();
String result = new String(outStream.toByteArray(), encoding);
assertEquals(c, result);
}

I'm not positive I understand the question, but something like this might help:
import java.nio.charset.CharsetEncoder;
...
StringBuilder buf = new StringBuilder(c.length());
CharsetEncoder enc = Charset.forName("gb2312");
for (int idx = 0; idx < c.length(); ++idx) {
char ch = c.charAt(idx);
if (enc.canEncode(ch))
buf.append(ch);
else {
buf.append("&#");
buf.append((int) ch);
buf.append(';');
}
}
String result = buf.toString();
This code is not robust, because it doesn't handle characters beyond the Basic Multilingual Plane. But iterating over code points in the String, and using the canEncode(CharSequence) method of the CharsetEncoder, you should be able to handle any character.

Try using StringEscapeUtils from apache commons.

Just use utf-8, and that way there is no reason to use entities.
If there is an argument that some clients need gb2312 because they don't understand Unicode, then entities are not much use either, because the numeric entities represent Unicode code points.

Related

outputstream writer java for integers

I am quite new in java, I need to save xml to csv using java, but problem is I cannot use CSVWriter because in xml there are also UTF8 encoded data.
Therefore I found out it is possible to use outputstreamwriter, which can be encoded in UTF8.
For string everything is ok, but for integer I cannot get correct number.
Sample code:
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.*;
public class UTF8WriterDemo {
public static void main(String[] args) {
Writer out = null;
try {
out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("c://java2//file.csv"), "windows-1250"));
//for (int i=0; i<4; i++ ) {
String text = "This tečt will be added to File !!";
int hu = 4;
out.write('\ufeff');
out.write(text+ '\n');
out.write(hu+ '\n');
//}
out.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
I get picture instead of a number.
I suppose it's because:
An OutputStreamWriter is a bridge from character streams to byte streams: Characters written to it are encoded into bytes using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
And that's why it's not displayed correctly.
Therefore I would like to ask, is there any option for integers to be displayed using outputstreamwriter?
Or if not, how can I convert xml data into csv using java for UTF8 encoded characters?
Thank you

Java has a difference between using double quotes and single quotes.
"foo" is a String.
'f' is a char (or Character)
'foo' will throw an Exception, because you can only have 1 character in a char.
'\n' is also 1 character, specifically the newline character. Adding a number and a character will use the number as an ASCII value and use the corresponding character, then combine both characters into a String (or array of characters, ie. char[]).
Using double quotes should fix your issue.

import java.io.*;
public class UTF8WriterDemo {
public static void main(String[] args) {
Writer out = null;
try {
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("file.csv"), "windows-1250"));
//for (int i = 0; i < 4; i++) {
String text = "This text will be added to File !!";
int hu = 4;
String text2 = new String("" + hu);
out.write('\ufeff');
out.write(text + '\n');
out.write(text2 + '\n');
// }
out.close();
} catch (Exception e) {
e.printStackTrace();
} finally {
System.out.println("The process is completed.");
}
}
}

actually I need to rewrite this construction:
FileWriter fileWriter = new
FileWriter("C:\\java\\test\\EEexample3.csv");
CSVWriter csvWriter = new CSVWriter(fileWriter);
csvWriter.writeNext(new String[] {
..
..
..
..
}
..code.. code..
String homeCurrencyPriceString = iit.getHomeCurrency().getPrice()!=null?iit.getHomeCurrency().getPrice().toString():"";
String headerDateString = invoiceHeaderType.getDateTax()!=null?invoiceHeaderType.getDateTax().toString():"";
String invoiceTypeString = invoiceHeaderType.getInvoiceType()!=null?invoiceHeaderType.getInvoiceType().value():"";
String headeraccountno= invoiceHeaderType.getAccount().getAccountNo()!=null?invoiceHeaderType.getAccount().getAccountNo().toString():"";
String headertext = invoiceHeaderType.getText()!=null?invoiceHeaderType.getText():"";
String invoiceitemtext= iit.getText()!=null?iit.getText():"";
String headericdph = invoiceHeaderType.getPartnerIdentity().getAddress().getIcDph()!=null?invoiceHeaderType.getPartnerIdentity().getAddress().getIcDph():"";
String symVar = invoiceHeaderType.getSymVar()!=null?invoiceHeaderType.getSymVar():"";
csvWriter.writeNext(new String[] {
invoiceHeaderType.getPartnerIdentity().getAddress().getIco(), headericdph, invoiceHeaderType.getPartnerIdentity().getAddress().getCompany(),symVar, invoiceHeaderType.getId().toString(), iit.getId().toString(), homeCurrencyPriceString, detailcentreString,headercentreString, headerDateString, invoiceTypeString,headeraccountno, headertext,invoiceitemtext
});
where objects are filled by xml
to outputstreamwriter construction.
So first I am trying outputstream as simple code, to be sure it`s working , next when it works, I wanted to rewrite the whole code.
Using CSVwriter everything works smoothly, just now there were added texts encoded in UTF8/windows1250 :( Therefore I need to fix the construction of code.
Even number objects like price are converted using .toString(), so maybe it works without int.
I hope writer of outputstreamwriter is able to do what is necessary.
I am going to try.

Formatting Web Service Response

I use the below function to retrieve the web service response:
private String getSoapResponse (String url, String host, String encoding, String soapAction, String soapRequest) throws MalformedURLException, IOException, Exception {
URL wsUrl = new URL(url);
URLConnection connection = wsUrl.openConnection();
HttpURLConnection httpConn = (HttpURLConnection)connection;
ByteArrayOutputStream bout = new ByteArrayOutputStream();
byte[] buffer = new byte[soapRequest.length()];
buffer = soapRequest.getBytes();
bout.write(buffer);
byte[] b = bout.toByteArray();
httpConn.setRequestMethod("POST");
httpConn.setRequestProperty("Host", host);
if (encoding == null || encoding == "")
encoding = UTF8;
httpConn.setRequestProperty("Content-Type", "text/xml; charset=" + encoding);
httpConn.setRequestProperty("Content-Length", String.valueOf(b.length));
httpConn.setRequestProperty("SOAPAction", soapAction);
httpConn.setDoOutput(true);
httpConn.setDoInput(true);
OutputStream out = httpConn.getOutputStream();
out.write(b);
out.close();
InputStreamReader is = new InputStreamReader(httpConn.getInputStream());
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(is);
String read = br.readLine();
while(read != null) {
sb.append(read);
read = br.readLine();
}
String response = decodeHtmlEntityCharacters(sb.toString());
return response = decodeHtmlEntityCharacters(response);
}
But my problem with this code is it returns lots of special characters and makes the structure of the XML invalid.
Example response:
<PLANT>A565</PLANT>
<PLANT>A567</PLANT>
<PLANT>A585</PLANT>
<PLANT>A921</PLANT>
<PLANT>A938</PLANT>
</PLANT_GROUP>
</KPI_PLANT_GROUP_KEYWORD>
<MSU_CUSTOMERS/>
</DU>
<DU>
So to solve this, I use the below method and pass the whole response to replace all the special characters with its corresponding punctuation.
private final static Hashtable htmlEntitiesTable = new Hashtable();
static {
htmlEntitiesTable.put("&","&");
htmlEntitiesTable.put(""","\"");
htmlEntitiesTable.put("<","<");
htmlEntitiesTable.put(">",">");
}
private String decodeHtmlEntityCharacters(String inputString) throws Exception {
Enumeration en = htmlEntitiesTable.keys();
while(en.hasMoreElements()){
String key = (String)en.nextElement();
String val = (String)htmlEntitiesTable.get(key);
inputString = inputString.replaceAll(key, val);
}
return inputString;
}
But another problem arised. If the response contains this segment <VALUE>< 0.5 </VALUE< and if this will be evaluated by the method, the output would be:
<VALUE>< 0.5</VALUE>
Which makes the structure of the XML invalid again.
The data is correct and valid "< 0.5" but having it within the VALUE elements causes issue on the structure of the XML.
Can you please help how to deal with this? Maybe the way I get or build the response can be improved. Is there any better way to call and get the response from web service?
How can I deal with elements containing "<" or ">"?

Do you know how to use a third-party open source library?
You should try using apache commons-lang:
StringEscapeUtils.unescapeXml(xml)
More detail is provided in the following stack overflow post:
how to unescape XML in java
Documentation:
http://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html
http://commons.apache.org/proper/commons-lang/userguide.html#lang3.

You're using SOAP wrong.
In particular, you do not need the following line of code:
String response = decodeHtmlEntityCharacters(sb.toString());
Just return sb.toString(). And for $DEITY's sake, do not use string methods to parse the retrieved string, use an XML parser, or a full-blown SOAP stack...

Does the > or < character always appear at the beginning of a value? Then you could use regex to handle the cases in which the > or < are followed by a digit (or dot, for that matter).
Sample code, assuming the replacement strings used in it don't appear anywhere else in the XML:
private String decodeHtmlEntityCharacters(String inputString) throws Exception {
Enumeration en = htmlEntitiesTable.keys();
// Replaces > or < followed by dot or digit (while keeping the dot/digit)
inputString = inputString.replaceAll(">(\\.?\\d)", "Valuegreaterthan$1");
inputString = inputString.replaceAll("<(\\.?\\d)", "Valuelesserthan$1");
while(en.hasMoreElements()){
String key = (String)en.nextElement();
String val = (String)htmlEntitiesTable.get(key);
inputString = inputString.replaceAll(key, val);
}
inputString = inputString.replaceAll("Valuelesserthan", "<");
inputString = inputString.replaceAll("Valuegreaterthan", ">");
return inputString;
}
Note the most appropriate answer (and easier for everyone) would be to correctly encode the XML at the sender side (it would also render my solution non-working BTW).

It would be hard to cope with all the situations but you could cover the most common ones by adding a few more rules by assuming that any less than followed by a space is data, and a greater than that has a space in front of it is data and need to be encoded again.
private final static Hashtable htmlEntitiesTable = new Hashtable();
static {
htmlEntitiesTable.put("&","&");
htmlEntitiesTable.put(""","\"");
htmlEntitiesTable.put("<","<");
htmlEntitiesTable.put(">",">");
}
private String decodeHtmlEntityCharacters(String inputString) throws Exception {
Enumeration en = htmlEntitiesTable.keys();
while(en.hasMoreElements()){
String key = (String)en.nextElement();
String val = (String)htmlEntitiesTable.get(key);
inputString = inputString.replaceAll(key, val);
}
inputString = inputString.replaceAll("< ","< ");
inputString = inputString.replaceAll(" >"," >");
return inputString;
}

'>' is not escaped in XML. So you shouldn't have an issue with that. Regarding '<', here are the options I can think of.
Use CDATA in web response for text containing special characters.
Rewrite the text by reversing the order. For eg. if it is x < 2, change it to 2 > x. '>' is not escaped unless its a part of CDATA.
Use another attribute or element in the XML response to indicate '<' or '>'.
Use regular expression to find a sequence that starts with '<' and followed by a string, and followed by '<' of the closing tag. And replace it with some code or some value that you can interpret and replace later.
Also, you don't need to do this:
String response = decodeHtmlEntityCharacters(sb.toString());
You should be able to parse the XML after you take care of the '<' sign in text.
You can use this site for testing regular expressions.

Why not serialize your xml?, its much easier than what you are doing.
for an example:
var ser = new XmlSerializer(typeof(MyXMLObject));
using (var reader = XmlReader.Create("http.....xml"))
{
MyXMLObject _myobj = (response)ser.Deserialize(reader);
}

How to read a string stream in Java discarding illegal characters?

I have to parse a stream of bytes coming from a TCP connection that's supposed to only give me printable characters, but in reality that's not always the case. I've seen some binary zeros in there, at the start and end of some fields. I have no control over the source of the data and I need to process the "dirty" lines. If I could just filter out the invalid characters, that'd be OK. The relevant code is as such:
srvr = new ServerSocket(myport);
skt = srvr.accept();
// Tried with no encoding argument too
in = new Scanner(skt.getInputStream(), "ISO-8859-1");
in.useDelimiter("[\r\n]");
for (;;) {
String myline = in.next();
if (!myline.equals(""))
ProcessRecord(myline);
}
I get an exception at every line that has "dirt." What's a good way to filter out invalid characters while still being able to obtain the rest of the string?

You have to wrap your InputStream in a CharsetDecoder, defining an empty error handler:
//let's create a decoder for ISO-8859-1 which will just ignore invalid data
CharsetDecoder decoder=Charset.forName("ISO-8859-1").newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
//let's wrap the inputstream into the decoder
InputStream is=skt.getInputStream();
in = new Scanner(decoder.decode(is));
you can also use a custom CodingErrorAction and define your own action in case of coding error.

The purest solution is to filter the InputStream (binary bytes-level I/O).
in = new Scanner(new DirtFilterInputStream(skt.getInputStream()), "Windows-1252");
public class DirtFilterInputStream extends InputStream {
private InputStream in;
public DirtFilterInputStream(InputStream in) {
this.in = in;
}
#Override
public int read() throws IOException {
int ch = in.read();
if (ch != -1) {
if (ch == 0) {
ch = read();
}
}
return ch;
}
}
(You need to override all methods, and delegate to the original stream.)
Windows-1252 is Windows Latin-1, an extended Latin 1, ISO-8859-1, using 0x80 - 0xBF.

I was completely off base. I get the "dirty" strings no problem (and NO, I have NO option to clean up the data source, it's from a closed system and I have to just grin and deal with it) but trying to store them in PostgreSQL is what gets me the exception. That means I have total freedom to clean it up before processing.

UTF-8 byte[] to String

Let's suppose I have just used a BufferedInputStream to read the bytes of a UTF-8 encoded text file into a byte array. I know that I can use the following routine to convert the bytes to a string, but is there a more efficient/smarter way of doing this than just iterating through the bytes and converting each one?
public String openFileToString(byte[] _bytes)
{
String file_string = "";
for(int i = 0; i < _bytes.length; i++)
{
file_string += (char)_bytes[i];
}
return file_string;
}

Look at the constructor for String
String str = new String(bytes, StandardCharsets.UTF_8);
And if you're feeling lazy, you can use the Apache Commons IO library to convert the InputStream to a String directly:
String str = IOUtils.toString(inputStream, StandardCharsets.UTF_8);

Java String class has a built-in-constructor for converting byte array to string.
byte[] byteArray = new byte[] {87, 79, 87, 46, 46, 46};
String value = new String(byteArray, "UTF-8");

To convert utf-8 data, you can't assume a 1-1 correspondence between bytes and characters.
Try this:
String file_string = new String(bytes, "UTF-8");
(Bah. I see I'm way to slow in hitting the Post Your Answer button.)
To read an entire file as a String, do something like this:
public String openFileToString(String fileName) throws IOException
{
InputStream is = new BufferedInputStream(new FileInputStream(fileName));
try {
InputStreamReader rdr = new InputStreamReader(is, "UTF-8");
StringBuilder contents = new StringBuilder();
char[] buff = new char[4096];
int len = rdr.read(buff);
while (len >= 0) {
contents.append(buff, 0, len);
}
return buff.toString();
} finally {
try {
is.close();
} catch (Exception e) {
// log error in closing the file
}
}
}

You can use the String(byte[] bytes) constructor for that. See this link for details.
EDIT You also have to consider your plateform's default charset as per the java doc:
Constructs a new String by decoding the specified array of bytes using
the platform's default charset. The length of the new String is a
function of the charset, and hence may not be equal to the length of
the byte array. The behavior of this constructor when the given bytes
are not valid in the default charset is unspecified. The
CharsetDecoder class should be used when more control over the
decoding process is required.

You could use the methods described in this question (especially since you start off with an InputStream): Read/convert an InputStream to a String
In particular, if you don't want to rely on external libraries, you can try this answer, which reads the InputStream via an InputStreamReader into a char[] buffer and appends it into a StringBuilder.

Knowing that you are dealing with a UTF-8 byte array, you'll definitely want to use the String constructor that accepts a charset name. Otherwise you may leave yourself open to some charset encoding based security vulnerabilities. Note that it throws UnsupportedEncodingException which you'll have to handle. Something like this:
public String openFileToString(String fileName) {
String file_string;
try {
file_string = new String(_bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// this should never happen because "UTF-8" is hard-coded.
throw new IllegalStateException(e);
}
return file_string;
}

Here's a simplified function that will read in bytes and create a string. It assumes you probably already know what encoding the file is in (and otherwise defaults).
static final int BUFF_SIZE = 2048;
static final String DEFAULT_ENCODING = "utf-8";
public static String readFileToString(String filePath, String encoding) throws IOException {
if (encoding == null || encoding.length() == 0)
encoding = DEFAULT_ENCODING;
StringBuffer content = new StringBuffer();
FileInputStream fis = new FileInputStream(new File(filePath));
byte[] buffer = new byte[BUFF_SIZE];
int bytesRead = 0;
while ((bytesRead = fis.read(buffer)) != -1)
content.append(new String(buffer, 0, bytesRead, encoding));
fis.close();
return content.toString();
}

String has a constructor that takes byte[] and charsetname as parameters :)

This also involves iterating, but this is much better than concatenating strings as they are very very costly.
public String openFileToString(String fileName)
{
StringBuilder s = new StringBuilder(_bytes.length);
for(int i = 0; i < _bytes.length; i++)
{
s.append((char)_bytes[i]);
}
return s.toString();
}

Why not get what you are looking for from the get go and read a string from the file instead of an array of bytes? Something like:
BufferedReader in = new BufferedReader(new InputStreamReader( new FileInputStream( "foo.txt"), Charset.forName( "UTF-8"));
then readLine from in until it's done.

I use this way
String strIn = new String(_bytes, 0, numBytes);

URLEncoder not able to translate space character

I am expecting
System.out.println(java.net.URLEncoder.encode("Hello World", "UTF-8"));
to output:
Hello%20World
(20 is ASCII Hex code for space)
However, what I get is:
Hello+World
Am I using the wrong method? What is the correct method I should be using?

This behaves as expected. The URLEncoder implements the HTML Specifications for how to encode URLs in HTML forms.
From the javadocs:
This class contains static methods for
converting a String to the
application/x-www-form-urlencoded MIME
format.
and from the HTML Specification:
application/x-www-form-urlencoded
Forms submitted with this content type
must be encoded as follows:
Control names and values are escaped. Space characters are replaced
by `+'
You will have to replace it, e.g.:
System.out.println(java.net.URLEncoder.encode("Hello World", "UTF-8").replace("+", "%20"));

A space is encoded to %20 in URLs, and to + in forms submitted data (content type application/x-www-form-urlencoded). You need the former.
Using Guava:
dependencies {
compile 'com.google.guava:guava:23.0'
// or, for Android:
compile 'com.google.guava:guava:23.0-android'
}
You can use UrlEscapers:
String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);
Don't use String.replace, this would only encode the space. Use a library instead.

This class perform application/x-www-form-urlencoded-type encoding rather than percent encoding, therefore replacing with + is a correct behaviour.
From javadoc:
When encoding a String, the following rules apply:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.

Encode Query params
org.apache.commons.httpclient.util.URIUtil
URIUtil.encodeQuery(input);
OR if you want to escape chars within URI
public static String escapeURIPathParam(String input) {
StringBuilder resultStr = new StringBuilder();
for (char ch : input.toCharArray()) {
if (isUnsafe(ch)) {
resultStr.append('%');
resultStr.append(toHex(ch / 16));
resultStr.append(toHex(ch % 16));
} else{
resultStr.append(ch);
}
}
return resultStr.toString();
}
private static char toHex(int ch) {
return (char) (ch < 10 ? '0' + ch : 'A' + ch - 10);
}
private static boolean isUnsafe(char ch) {
if (ch > 128 || ch < 0)
return true;
return " %$&+,/:;=?#<>#%".indexOf(ch) >= 0;
}

Hello+World is how a browser will encode form data (application/x-www-form-urlencoded) for a GET request and this is the generally accepted form for the query part of a URI.
http://host/path/?message=Hello+World
If you sent this request to a Java servlet, the servlet would correctly decode the parameter value. Usually the only time there are issues here is if the encoding doesn't match.
Strictly speaking, there is no requirement in the HTTP or URI specs that the query part to be encoded using application/x-www-form-urlencoded key-value pairs; the query part just needs to be in the form the web server accepts. In practice, this is unlikely to be an issue.
It would generally be incorrect to use this encoding for other parts of the URI (the path for example). In that case, you should use the encoding scheme as described in RFC 3986.
http://host/Hello%20World
More here.

If you want to encode URI path components, you can also use standard JDK functions, e.g.
public static String encodeURLPathComponent(String path) {
try {
return new URI(null, null, path, null).toASCIIString();
} catch (URISyntaxException e) {
// do some error handling
}
return "";
}
The URI class can also be used to encode different parts of or whole URIs.

Just been struggling with this too on Android, managed to stumble upon Uri.encode(String, String) while specific to android (android.net.Uri) might be useful to some.
static String encode(String s, String allow)
https://developer.android.com/reference/android/net/Uri.html#encode(java.lang.String, java.lang.String)

The other answers either present a manual string replacement, URLEncoder which actually encodes for HTML format, Apache's abandoned URIUtil, or using Guava's UrlEscapers. The last one is fine, except it doesn't provide a decoder.
Apache Commons Lang provides the URLCodec, which encodes and decodes according to URL format rfc3986.
String encoded = new URLCodec().encode(str);
String decoded = new URLCodec().decode(str);
If you are already using Spring, you can also opt to use its UriUtils class as well.

Although quite old, nevertheless a quick response:
Spring provides UriUtils - with this you can specify how to encoded and which part is it related from an URI, e.g.
encodePathSegment
encodePort
encodeFragment
encodeUriVariables
....
I use them cause we already using Spring, i.e. no additonal library is required!

If you are using jetty then org.eclipse.jetty.util.URIUtil will solve the issue.
String encoded_string = URIUtil.encodePath(not_encoded_string).toString();

This worked for me
org.apache.catalina.util.URLEncoder ul = new org.apache.catalina.util.URLEncoder().encode("MY URL");

It's not one-liner, but you can use:
URL url = new URL("https://some-host.net/dav/files/selling_Rosetta Stone Case Study.png.aes");
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
System.out.println(uri.toString());
This will give you an output:
https://some-host.net/dav/files/selling_Rosetta%20Stone%20Case%20Study.png.aes

"+" is correct. If you really need %20, then replace the Plusses yourself afterwards.
Warning: This answer is heavily disputed (+8 vs. -6), so take this with a grain of salt.

I was already using Feign so UriUtils was available to me but Spring UrlUtils was not.
<!-- https://mvnrepository.com/artifact/io.github.openfeign/feign-core -->
<dependency>
<groupId>io.github.openfeign</groupId>
<artifactId>feign-core</artifactId>
<version>11.8</version>
</dependency>
My Feign test code:
import feign.template.UriUtils;
System.out.println(UriUtils.encode("Hello World"));
Outputs:
Hello%20World
As the class suggests, it encodes URIs and not URLs but the OP asked about URIs and not URLs.
System.out.println(UriUtils.encode("https://some-host.net/dav/files/selling_Rosetta Stone Case Study.png.aes"));
Outputs:
https%3A%2F%2Fsome-host.net%2Fdav%2Ffiles%2Fselling_Rosetta%20Stone%20Case%20Study.png.aes

Try below approach:
Add a new dependency
<!-- https://mvnrepository.com/artifact/org.apache.tomcat/tomcat-catalina -->
<dependency>
<groupId>org.apache.tomcat</groupId>
<artifactId>tomcat-catalina</artifactId>
<version>10.0.13</version>
</dependency>
Now do as follows:
String str = "Hello+World"; // For "Hello World", decoder is not required
// import java.net.URLDecoder;
String newURL = URLDecoder.decode(str, StandardCharsets.UTF_8);
// import org.apache.catalina.util.URLEncoder;
System.out.println(URLEncoder.DEFAULT.encode(newURL, StandardCharsets.UTF_8));
You'll get the output as:
Hello%20World

Check out the java.net.URI class.

USE MyUrlEncode.URLencoding(String url , String enc) to handle the problem
public class MyUrlEncode {
static BitSet dontNeedEncoding = null;
static final int caseDiff = ('a' - 'A');
static {
dontNeedEncoding = new BitSet(256);
int i;
for (i = 'a'; i <= 'z'; i++) {
dontNeedEncoding.set(i);
}
for (i = 'A'; i <= 'Z'; i++) {
dontNeedEncoding.set(i);
}
for (i = '0'; i <= '9'; i++) {
dontNeedEncoding.set(i);
}
dontNeedEncoding.set('-');
dontNeedEncoding.set('_');
dontNeedEncoding.set('.');
dontNeedEncoding.set('*');
dontNeedEncoding.set('&');
dontNeedEncoding.set('=');
}
public static String char2Unicode(char c) {
if(dontNeedEncoding.get(c)) {
return String.valueOf(c);
}
StringBuffer resultBuffer = new StringBuffer();
resultBuffer.append("%");
char ch = Character.forDigit((c >> 4) & 0xF, 16);
if (Character.isLetter(ch)) {
ch -= caseDiff;
}
resultBuffer.append(ch);
ch = Character.forDigit(c & 0xF, 16);
if (Character.isLetter(ch)) {
ch -= caseDiff;
}
resultBuffer.append(ch);
return resultBuffer.toString();
}
private static String URLEncoding(String url,String enc) throws UnsupportedEncodingException {
StringBuffer stringBuffer = new StringBuffer();
if(!dontNeedEncoding.get('/')) {
dontNeedEncoding.set('/');
}
if(!dontNeedEncoding.get(':')) {
dontNeedEncoding.set(':');
}
byte [] buff = url.getBytes(enc);
for (int i = 0; i < buff.length; i++) {
stringBuffer.append(char2Unicode((char)buff[i]));
}
return stringBuffer.toString();
}
private static String URIEncoding(String uri , String enc) throws UnsupportedEncodingException { //对请求参数进行编码
StringBuffer stringBuffer = new StringBuffer();
if(dontNeedEncoding.get('/')) {
dontNeedEncoding.clear('/');
}
if(dontNeedEncoding.get(':')) {
dontNeedEncoding.clear(':');
}
byte [] buff = uri.getBytes(enc);
for (int i = 0; i < buff.length; i++) {
stringBuffer.append(char2Unicode((char)buff[i]));
}
return stringBuffer.toString();
}
public static String URLencoding(String url , String enc) throws UnsupportedEncodingException {
int index = url.indexOf('?');
StringBuffer result = new StringBuffer();
if(index == -1) {
result.append(URLEncoding(url, enc));
}else {
result.append(URLEncoding(url.substring(0 , index),enc));
result.append("?");
result.append(URIEncoding(url.substring(index+1),enc));
}
return result.toString();
}
}

Am I using the wrong method? What is the correct method I should be using?
Yes, this method java.net.URLEncoder.encode wasn't made for converting " " to "20%" according to spec (source).
The space character " " is converted into a plus sign "+".
Even this is not the correct method, you can modify this to: System.out.println(java.net.URLEncoder.encode("Hello World", "UTF-8").replaceAll("\\+", "%20"));have a nice day =).

use character-set "ISO-8859-1" for URLEncoder

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to convert non-supported character to html entity in Java - java

Try using StringEscapeUtils from apache commons.

Just use utf-8, and that way there is no reason to use entities. If there is an argument that some clients need gb2312 because they don't understand Unicode, then entities are not much use either, because the numeric entities represent Unicode code points.

Related

outputstream writer java for integers

Formatting Web Service Response

How to read a string stream in Java discarding illegal characters?

UTF-8 byte[] to String

URLEncoder not able to translate space character

Categories

Resources