I've got an oddball problem here. I've got a little java program that filters Minecraft log files to make them easier to read. On each line of these logs, there are usually multiple instances of the character "§", which returns a hex value of FFFD.
I am filtering out this character (as well as the character following it) using:
currentLine = currentLine.replaceAll("\uFFFD.", "");
Now, when I run the program through NetBeans, it works swell. My lines get outputted looking like this:
CxndyAnnie: Mhm
CxndyAnnie: Sorry
But when I build the .jar file and wrap it into a .exe file using JSmooth, that character no longer gets filtered out when I run the .exe, and my lines come out looking like this:
§e§7[§f$65§7] §1§nCxndyAnnie§e: Mhm
§e§7[§f$65§7] §1§nCxndyAnnie§e: Sorry
(note: the additional square brackets and $65 show up because their filtering is dependent on the special character and it's following character being removed first)
Any ideas why this would no longer work after putting it through JSmooth? Is there a different way to do the text replace that would preserve its function?
By the way, I also attempted to remove this character using
currentLine = currentLine.replaceAll("§.", "");
but that didn't work in Netbeans nor as a .exe.
I'll go ahead and past the full method below:
public static String[] filterLines(String[] allLines, String filterType, Boolean timeStamps) throws IOException {
String currentLine = null;
FileWriter saveFile = new FileWriter("readable.txt");
String heading;
String string1 = "[L]";
String string2 = "[A]";
String string3 = "[G]";
if (filterType.equals(string1)) {
heading = "LOCAL CHAT LOGS ONLY \r\n\r\n";
}
else if (filterType.equals(string2)) {
heading = "ADVERTISING CHAT LOGS ONLY \r\n\r\n";
}
else if (filterType.equals(string3)) {
heading = "GLOBAL CHAT LOGS ONLY \r\n\r\n";
}
else {
heading = "CHAT LINES CONTAINING \"" + filterType + "\" \r\n\r\n";
}
saveFile.write(heading);
for (int i = 0; i < allLines.length; i++) {
if ((allLines[i] != null ) && (allLines[i].contains(filterType))) {
currentLine = allLines[i];
if (!timeStamps) {
currentLine = currentLine.replaceAll("\\[..:..:..\\].", "");
}
currentLine = currentLine.replaceAll("\\[Client thread/INFO\\]:.", "");
currentLine = currentLine.replaceAll("\\[CHAT\\].", "");
currentLine = currentLine.replaceAll("\uFFFD.", "");
currentLine = currentLine.replaceAll("\\[A\\].", "");
currentLine = currentLine.replaceAll("\\[L\\].", "");
currentLine = currentLine.replaceAll("\\[G\\].", "");
currentLine = currentLine.replaceAll("\\[\\$..\\].", "");
currentLine = currentLine.replaceAll(".>", ":");
currentLine = currentLine.replaceAll("\\[\\$100\\].", "");
saveFile.write(currentLine + "\r\n");
//System.out.println(currentLine);
}
}
saveFile.close();
ProcessBuilder openFile = new ProcessBuilder("Notepad.exe", "readable.txt");
openFile.start();
return allLines;
}
FINAL EDIT
Just in case anyone stumbles across this and needs to know what finally worked, here's the snippet of code where I pull the lines from the file and re-encode it to work:
BufferedReader fileLines;
fileLines = new BufferedReader(new FileReader(file));
String[] allLines = new String[numLines];
int i=0;
while ((line = fileLines.readLine()) != null) {
byte[] bLine = line.getBytes();
String convLine = new String(bLine, Charset.forName("UTF-8"));
allLines[i] = convLine;
i++;
}
I also had a problem like this in the past with minecroft logs, I don’t remember the exact details, but the issue came down to a file format problem, where UTF8 encoding worked correctly but some other text encoding including the system default did not work correctly.
First:
Make sure that you specify UTF8 encoding when reading the byteArray from file so that allLines contains the correct info like so:
Path fileLocation = Paths.get("C:/myFileLocation/logs.txt");
byte[] data = Files.readAllBytes(fileLocation);
String allLines = new String(data , Charset.forName("UTF-8"));
Second:
Using \uFFFD is not going to work, because \uFFFD is only used to replace an incoming character whose value is unknown or unrepresentable in Unicode.
However if you used the correct encoding (shown in my first point) then \uFFFD is not necessary because the value § is known in unicode so you can simply use
currentLine.replaceAll("§", "");
or specifically use the actual unicode string for that character U+00A7 like so
currentLine.replaceAll("\u00A7", "");
or just use both those lines in your code.
I have a text document in which I have a bunch of urls of the form /courses/......./.../..
and from among these urls, I only want to extract those urls that are of the form /courses/.../lecture-notes. Meaning the urls that begin with /courses and ends with /lecture-notes.
Would anyone know of a good way to do this with regular expressions or just by string matching?
Here's one alternative:
Scanner s = new Scanner(new FileReader("filename.txt"));
String str;
while (null != (str = s.findWithinHorizon("/courses/\\S*/lecture-notes", 0)))
System.out.println(str);
Given a filename.txt with the content
Here /courses/lorem/lecture-notes and
here /courses/ipsum/dolor/lecture-notes perhaps.
the above snippet prints
/courses/lorem/lecture-notes
/courses/ipsum/dolor/lecture-notes
The following will only return the middle part (ie: exclude /courses/ and /lectures-notes/:
Pattern p = Pattern.compile("/courses/(.*)/lectures-notes");
Matcher m = p.matcher(yourStrnig);
if(m.find()).
return m.group(1) // The "1" here means it'll return the first part of the regex between parethesis.
Assuming that you have 1 URL per line, could use:
BufferedReader br = new BufferedReader(new FileReader("urls.txt"));
String urlLine;
while ((urlLine = br.readLine()) != null) {
if (urlLine.matches("/courses/.*/lecture-notes")) {
// use url
}
}
I am expecting
System.out.println(java.net.URLEncoder.encode("Hello World", "UTF-8"));
to output:
Hello%20World
(20 is ASCII Hex code for space)
However, what I get is:
Hello+World
Am I using the wrong method? What is the correct method I should be using?
This behaves as expected. The URLEncoder implements the HTML Specifications for how to encode URLs in HTML forms.
From the javadocs:
This class contains static methods for
converting a String to the
application/x-www-form-urlencoded MIME
format.
and from the HTML Specification:
application/x-www-form-urlencoded
Forms submitted with this content type
must be encoded as follows:
Control names and values are escaped. Space characters are replaced
by `+'
You will have to replace it, e.g.:
System.out.println(java.net.URLEncoder.encode("Hello World", "UTF-8").replace("+", "%20"));
A space is encoded to %20 in URLs, and to + in forms submitted data (content type application/x-www-form-urlencoded). You need the former.
Using Guava:
dependencies {
compile 'com.google.guava:guava:23.0'
// or, for Android:
compile 'com.google.guava:guava:23.0-android'
}
You can use UrlEscapers:
String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);
Don't use String.replace, this would only encode the space. Use a library instead.
This class perform application/x-www-form-urlencoded-type encoding rather than percent encoding, therefore replacing with + is a correct behaviour.
From javadoc:
When encoding a String, the following rules apply:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
Encode Query params
org.apache.commons.httpclient.util.URIUtil
URIUtil.encodeQuery(input);
OR if you want to escape chars within URI
public static String escapeURIPathParam(String input) {
StringBuilder resultStr = new StringBuilder();
for (char ch : input.toCharArray()) {
if (isUnsafe(ch)) {
resultStr.append('%');
resultStr.append(toHex(ch / 16));
resultStr.append(toHex(ch % 16));
} else{
resultStr.append(ch);
}
}
return resultStr.toString();
}
private static char toHex(int ch) {
return (char) (ch < 10 ? '0' + ch : 'A' + ch - 10);
}
private static boolean isUnsafe(char ch) {
if (ch > 128 || ch < 0)
return true;
return " %$&+,/:;=?#<>#%".indexOf(ch) >= 0;
}
Hello+World is how a browser will encode form data (application/x-www-form-urlencoded) for a GET request and this is the generally accepted form for the query part of a URI.
http://host/path/?message=Hello+World
If you sent this request to a Java servlet, the servlet would correctly decode the parameter value. Usually the only time there are issues here is if the encoding doesn't match.
Strictly speaking, there is no requirement in the HTTP or URI specs that the query part to be encoded using application/x-www-form-urlencoded key-value pairs; the query part just needs to be in the form the web server accepts. In practice, this is unlikely to be an issue.
It would generally be incorrect to use this encoding for other parts of the URI (the path for example). In that case, you should use the encoding scheme as described in RFC 3986.
http://host/Hello%20World
More here.
If you want to encode URI path components, you can also use standard JDK functions, e.g.
public static String encodeURLPathComponent(String path) {
try {
return new URI(null, null, path, null).toASCIIString();
} catch (URISyntaxException e) {
// do some error handling
}
return "";
}
The URI class can also be used to encode different parts of or whole URIs.
Just been struggling with this too on Android, managed to stumble upon Uri.encode(String, String) while specific to android (android.net.Uri) might be useful to some.
static String encode(String s, String allow)
https://developer.android.com/reference/android/net/Uri.html#encode(java.lang.String, java.lang.String)
The other answers either present a manual string replacement, URLEncoder which actually encodes for HTML format, Apache's abandoned URIUtil, or using Guava's UrlEscapers. The last one is fine, except it doesn't provide a decoder.
Apache Commons Lang provides the URLCodec, which encodes and decodes according to URL format rfc3986.
String encoded = new URLCodec().encode(str);
String decoded = new URLCodec().decode(str);
If you are already using Spring, you can also opt to use its UriUtils class as well.
Although quite old, nevertheless a quick response:
Spring provides UriUtils - with this you can specify how to encoded and which part is it related from an URI, e.g.
encodePathSegment
encodePort
encodeFragment
encodeUriVariables
....
I use them cause we already using Spring, i.e. no additonal library is required!
If you are using jetty then org.eclipse.jetty.util.URIUtil will solve the issue.
String encoded_string = URIUtil.encodePath(not_encoded_string).toString();
This worked for me
org.apache.catalina.util.URLEncoder ul = new org.apache.catalina.util.URLEncoder().encode("MY URL");
It's not one-liner, but you can use:
URL url = new URL("https://some-host.net/dav/files/selling_Rosetta Stone Case Study.png.aes");
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
System.out.println(uri.toString());
This will give you an output:
https://some-host.net/dav/files/selling_Rosetta%20Stone%20Case%20Study.png.aes
"+" is correct. If you really need %20, then replace the Plusses yourself afterwards.
Warning: This answer is heavily disputed (+8 vs. -6), so take this with a grain of salt.
I was already using Feign so UriUtils was available to me but Spring UrlUtils was not.
<!-- https://mvnrepository.com/artifact/io.github.openfeign/feign-core -->
<dependency>
<groupId>io.github.openfeign</groupId>
<artifactId>feign-core</artifactId>
<version>11.8</version>
</dependency>
My Feign test code:
import feign.template.UriUtils;
System.out.println(UriUtils.encode("Hello World"));
Outputs:
Hello%20World
As the class suggests, it encodes URIs and not URLs but the OP asked about URIs and not URLs.
System.out.println(UriUtils.encode("https://some-host.net/dav/files/selling_Rosetta Stone Case Study.png.aes"));
Outputs:
https%3A%2F%2Fsome-host.net%2Fdav%2Ffiles%2Fselling_Rosetta%20Stone%20Case%20Study.png.aes
Try below approach:
Add a new dependency
<!-- https://mvnrepository.com/artifact/org.apache.tomcat/tomcat-catalina -->
<dependency>
<groupId>org.apache.tomcat</groupId>
<artifactId>tomcat-catalina</artifactId>
<version>10.0.13</version>
</dependency>
Now do as follows:
String str = "Hello+World"; // For "Hello World", decoder is not required
// import java.net.URLDecoder;
String newURL = URLDecoder.decode(str, StandardCharsets.UTF_8);
// import org.apache.catalina.util.URLEncoder;
System.out.println(URLEncoder.DEFAULT.encode(newURL, StandardCharsets.UTF_8));
You'll get the output as:
Hello%20World
Check out the java.net.URI class.
USE MyUrlEncode.URLencoding(String url , String enc) to handle the problem
public class MyUrlEncode {
static BitSet dontNeedEncoding = null;
static final int caseDiff = ('a' - 'A');
static {
dontNeedEncoding = new BitSet(256);
int i;
for (i = 'a'; i <= 'z'; i++) {
dontNeedEncoding.set(i);
}
for (i = 'A'; i <= 'Z'; i++) {
dontNeedEncoding.set(i);
}
for (i = '0'; i <= '9'; i++) {
dontNeedEncoding.set(i);
}
dontNeedEncoding.set('-');
dontNeedEncoding.set('_');
dontNeedEncoding.set('.');
dontNeedEncoding.set('*');
dontNeedEncoding.set('&');
dontNeedEncoding.set('=');
}
public static String char2Unicode(char c) {
if(dontNeedEncoding.get(c)) {
return String.valueOf(c);
}
StringBuffer resultBuffer = new StringBuffer();
resultBuffer.append("%");
char ch = Character.forDigit((c >> 4) & 0xF, 16);
if (Character.isLetter(ch)) {
ch -= caseDiff;
}
resultBuffer.append(ch);
ch = Character.forDigit(c & 0xF, 16);
if (Character.isLetter(ch)) {
ch -= caseDiff;
}
resultBuffer.append(ch);
return resultBuffer.toString();
}
private static String URLEncoding(String url,String enc) throws UnsupportedEncodingException {
StringBuffer stringBuffer = new StringBuffer();
if(!dontNeedEncoding.get('/')) {
dontNeedEncoding.set('/');
}
if(!dontNeedEncoding.get(':')) {
dontNeedEncoding.set(':');
}
byte [] buff = url.getBytes(enc);
for (int i = 0; i < buff.length; i++) {
stringBuffer.append(char2Unicode((char)buff[i]));
}
return stringBuffer.toString();
}
private static String URIEncoding(String uri , String enc) throws UnsupportedEncodingException { //对请求参数进行编码
StringBuffer stringBuffer = new StringBuffer();
if(dontNeedEncoding.get('/')) {
dontNeedEncoding.clear('/');
}
if(dontNeedEncoding.get(':')) {
dontNeedEncoding.clear(':');
}
byte [] buff = uri.getBytes(enc);
for (int i = 0; i < buff.length; i++) {
stringBuffer.append(char2Unicode((char)buff[i]));
}
return stringBuffer.toString();
}
public static String URLencoding(String url , String enc) throws UnsupportedEncodingException {
int index = url.indexOf('?');
StringBuffer result = new StringBuffer();
if(index == -1) {
result.append(URLEncoding(url, enc));
}else {
result.append(URLEncoding(url.substring(0 , index),enc));
result.append("?");
result.append(URIEncoding(url.substring(index+1),enc));
}
return result.toString();
}
}
Am I using the wrong method? What is the correct method I should be using?
Yes, this method java.net.URLEncoder.encode wasn't made for converting " " to "20%" according to spec (source).
The space character " " is converted into a plus sign "+".
Even this is not the correct method, you can modify this to: System.out.println(java.net.URLEncoder.encode("Hello World", "UTF-8").replaceAll("\\+", "%20"));have a nice day =).
use character-set "ISO-8859-1" for URLEncoder
I want to make am HTTP GET request from my J2ME application using HttpConnection class.
The problem is that I cannot send russian text in the query string.
Here is the example of how I'm sending the request
c = (HttpConnection)Connector.open("http://127.0.0.1:1418/zp.ashx?тест");
InputStream s = c.openInputStream();
The receiving asp.net script receives the query part of the url as %3f%3f%3f%3f
That is 4 identical codes. Definately that's not what I'm sending
So how can I send non-latin text in an http query in J2ME?
Thank you in advance
Your code
Connector.open("http://127.0.0.1:1418/zp.ashx?тест");
is processed by a java.nio.CharsetDecoder for the ASCII character set, and this decoder replaces all unknown characters with its replacement.
To get the behavior you want, you have to encode the URL before sending it. For example, when your server expects the URLs to be UTF8-encoded:
String encodedParameter = URLEncoder.encode("тест", "UTF-8");
Connector.open("http://127.0.0.1:1418/zp.ashx?" + encodedParameter);
Note that if you have multiple parameters, you have to encode both the parameter names and the parameter values individually, before putting them together with "=" and concatenating them with "&". If you need to encode multiple parameters, this class may be helpful to you:
import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;
public class UrlParamGenerator {
private final String encoding;
private final StringBuilder sb = new StringBuilder();
private String separator = "?";
public UrlParamGenerator(String charset) {
this.encoding = charset;
}
public void add(String key, String value) throws UnsupportedEncodingException {
sb.append(separator);
sb.append(URLEncoder.encode(key, encoding));
sb.append("=");
sb.append(URLEncoder.encode(value, encoding));
separator = "&";
}
#Override
public String toString() {
return sb.toString();
}
public static void main(String[] args) throws UnsupportedEncodingException {
UrlParamGenerator gen = new UrlParamGenerator("UTF-8");
gen.add("test", "\u0442\u0435\u0441\u0442");
gen.add("x", "0");
System.out.println(gen.toString());
}
}
You might need to explicitly set a character set in the HTTP header that supports the cyrillic alphabet. You could either use UTF-8 or another charset, such as windows-1251 (although UTF-8 should be the preferred choice).
c.setRequestProperty("Content-type", "application/x-www-form-urlencoded;charset=utf-8");
c = (HttpConnection)Connector.open("http://127.0.0.1:1418/zp.ashx?тест");
If you use an appropriate charset, the server should be able to properly handle the cyrillic request parameter - provided it too supports this charset.
URL can contain only ASCII chars and a few punctuation chars. For other chars, you must %-encode them before adding them in the URL. Use URLEncoder.encode("тест", enc) where the enc parameter is the encoding scheme that the server expects.
Some character not support by certain charset, so below test fail. I would like to use html entity to encode ONLY those not supported character. How, in java?
public void testWriter() throws IOException{
String c = "\u00A9";
String encoding = "gb2312";
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
Writer writer = new BufferedWriter(new OutputStreamWriter(outStream, encoding));
writer.write(c);
writer.close();
String result = new String(outStream.toByteArray(), encoding);
assertEquals(c, result);
}
I'm not positive I understand the question, but something like this might help:
import java.nio.charset.CharsetEncoder;
...
StringBuilder buf = new StringBuilder(c.length());
CharsetEncoder enc = Charset.forName("gb2312");
for (int idx = 0; idx < c.length(); ++idx) {
char ch = c.charAt(idx);
if (enc.canEncode(ch))
buf.append(ch);
else {
buf.append("&#");
buf.append((int) ch);
buf.append(';');
}
}
String result = buf.toString();
This code is not robust, because it doesn't handle characters beyond the Basic Multilingual Plane. But iterating over code points in the String, and using the canEncode(CharSequence) method of the CharsetEncoder, you should be able to handle any character.
Try using StringEscapeUtils from apache commons.
Just use utf-8, and that way there is no reason to use entities.
If there is an argument that some clients need gb2312 because they don't understand Unicode, then entities are not much use either, because the numeric entities represent Unicode code points.