My java program gets some weather information from an API. But it has weird letters in the text. Looks like ASCII code.
Here is an example:
Min temp: 0°C (32°F) which should be: Min temp: 0C (32F) (i think).
How can I change it?
Well one solution can be before posting you can do following
String withoutDegSymbol = str.replaceAll("°", "");
Where str contains you temperature data.
try this
String s = "0°C (32°F)".replaceAll("[\u0080-\u00FF]", "");
or if you have HTML character references in your text use
String s = "Min temp: 0°C (32°F)".replaceAll("&#x.+?;", "");
If using ASCII character encoding in your codes, when you saved your code, did your IDE asked you in what format you want to save it. Because in Eclipse IDE, if you are using an ASCII character, it prompts you to save your code in UTF-8 format. Hope this helps.
You need to know that :
° :
is Unicode Character for 'DEGREE SIGN'
Encoding :HTML Entity (hex)
so if this is the just problem you have (i mean this is the only character you use "Degree sign") , so you can convert it manually Like that :
String s = "Min temp: 0°C (32°F)".replaceAll("°", "°");
System.out.println(s);
if this is not the special character you have , so you may use : Class StringEscapeUtils
or jsoup library to convert it to java
Related
I am getting this string while I got the content over JMSQ. While printing I see the following line. I see those are vertical tab characters in XML. But how should I get rid of them.
#011#011#011<xeh:eid>dljfl</xeh:eid>
I have tried
replaceAll("[\\x0B]", "");
but it's not working.
Just do this:
String a = "#011#011#011<xeh:eid>dljfl</xeh:eid>";
String a_wo_vt_chars = a.replaceAll("#011", "");
"#011#011#011<xeh:eid>dljfl</xeh:eid>".replaceAll("#011", "") works fine, results in <xeh:eid>dljfl</xeh:eid>
According to the Pattern javadoc, \xhh stands for "the character with hexadecimal value 0xhh". But I guess in your string literal, #011 is just literal characters.
If I try to replicate the vertical tab in a string literal, it works with \\x0B:
"\u000b\u000b\u000b<xeh:eid>dljfl</xeh:eid>".replaceAll("\\x0B", "")
But maybe we are reading it wrong. While #0B is 11, #11 might be 17...
When #011 represents the hexvalue for the char you can use
a.replaceAll("\\u0011", "");
// or
a.replaceAll("\\x11", "");
But if #011 represents the octal value the use
a.replaceAll("\\011", "")
Also see Unicode Regular Expressions
I refer to the following web site:
http://coderstoolbox.net/string/#!encoding=xml&action=encode&charset=us_ascii
Choosing "URL", "Encode", and "US-ASCII", the input is converted to the desired output.
How do I produce the same output with Java codes?
Thanks in advance.
I used this and it seems to work fine.
public static String encode(String input) {
Pattern doNotReplace = Pattern.compile("[a-zA-Z0-9]");
return input.chars().mapToObj(c->{
if(!doNotReplace.matcher(String.valueOf((char)c)).matches()){
return "%" + (c<256?Integer.toHexString(c):"u"+Integer.toHexString(c));
}
return String.valueOf((char)c);
}).collect(Collectors.joining("")).toUpperCase();
}
PS: I'm using 256 to limit the placement of the prefix U to non-ASCII characters. No need of prefix U for standard ASCII characters which are within 256.
Alternate option:
There is a built-in Java class (java.net.URLEncoder) that does URL Encoding. But it works a little differently (For example, it does not replace the Space character with %20, but replaces with a + instead. Something similar happens with other characters too). See if it helps:
String encoded = URLEncoder.encode(input, "US-ASCII");
Hope this helps!
You can use ESAPi.encoder().encodeForUrl(linkString)
Check more details on encodeForUrl https://en.wikipedia.org/wiki/Percent-encoding
please comment if that does not satisfy your requirement or face any other issue.
Thanks
I'm trying to use extended ascii character 179(looks like pipe).
Here is how I use it.
String cmd = "";
char pipe = (char) 179;
// cmd ="02|CO|0|101|03|0F""
cmd ="02"+pipe+"CO"+pipe+"0"+pipe+"101"+pipe+"03"+pipe+"0F";
System.out.println("cmd "+cmd);
Output
cmd 02³CO³0³101³03³0F
But the output is like this . I have read that extended ascii characters are not displayed correctly.
Is my code correct and just the ascii is not correctly displayed
or my code is wrong.
I'm not concerned about showing this string to user I need to send it to server.
EDIT
The vendor's api document states that we need to use ascii 179 (looks like pipe) . The server side code needs 179(part of extended ascii) as pipe/vertical line so I cannot use 124(pipe)
EDIT 2
Here is the table for extended ascii
On the other hand this table shows that ascii 179 is "3" . Why
are there different interpretation of the same and which one should I
consider??
EDIT 3
My default charset value is (is this related to my problem?)
System.out.println("Default Charset=" + Charset.defaultCharset());
Default Charset=windows-1252
Thanks!
I have referred to
How to convert a char to a String?
How to print the extended ASCII code in java from integer value
Thanks
Use the below code.
String cmd = "";
char pipe = '\u2502';
cmd ="02"+pipe+"CO"+pipe+"0"+pipe+"101"+pipe+"03"+pipe+"0F";
System.out.println("cmd "+cmd);
System.out.println("int value: " + (int)pipe);
Output:
cmd 02│CO│0│101│03│0F
int value: 9474
I am using IntelliJ. This is the output I am getting.
Your code is correct; concatenating String values and char values does what one expects. It's the value of 179 that is wrong. You can google "unicode 179", and you'll find "Unicode Character 'SUPERSCRIPT THREE' (U+00B3)", as one might expect. And, you could simply say "char pipe = '|';" instead of using an integer. Or even better: String pipe = "|"; which also allows you the flexibility to use more than one character :)
In response to the new edits...
May I suggest that you fix this rather low-level problem not at the Java String level, but instead replace the byte encoding this character before sending the bytes to the server?
E.g. something like this (untested)
byte[] bytes = cmd.getBytes(); // all ascii, so this should be safe.
for (int i = 0; i < bytes.length; i++) {
if (bytes[i] == '|') {
bytes[i] = (byte)179;
}
}
// send command bytes to server
// don't forget endline bytes/chars or whatever the protocol might require. good luck :)
I'm having trouble in concatenating pieces of text mixing Western and Arabic chars.
I've a list of tokens like this:
-LRB-
دریای
مازندران
-RRB-
,
I use the following procedure to concatenate these list of tokens:
String str = "";
for (String tok : tokens) {
str += tok + " ";
}
This is the output of my procedure:
-LRB- دریای مازندران -RRB- ,
As can be seen, the position of the Arabic words is inverted.
How can I solve this (maybe suggesting to Java to ignore the information about text direction)?
EDIT
Actually, it seems that my problem was a false problem.
Now I've a new one. I need to wrap each word inside a string like this (word *) so that my output will be like this:
(word1 *)(word2 *)(word3 *)...
The procedure that I use is the following:
String str = "";
for (String tok : tokens) {
str += "(" + tok + "*)";
}
However, the result that I got is this:
(-LRB- *)(دریای *)(مازندران *)(-RRB- *)(, *)
instead of:
(-LRB- *)(دریای)(* مازندران *)(-RRB- *)(, *)
** EDIT2 **
Actually, I've discovered that my problem is not a problem. I wrote my string on a file and I opened it with nano (in the console). And it was correctly concatenated.
So the problem was due to the Eclipse console (and also gedit) which --let's say-- incorrectly rendered the string.
Anyway, thanks for your help!
The output is correct, and if you are presenting this text to an Arabic-speaking user you should not override the directionality of the text. Arabic is written from right to left. When you concatenate two Arabic strings together, the first will appear to the right of the second. This is controlled by the BiDi algorithm, the details of which are covered in http://www.unicode.org/reports/tr9/.
First, I would suggest using StringBuilder instead of raw String concatination. You will make your Garbage Collector a lot happier. Second, not seeing the input or how your StringTokenizer is setup, I would venture a guess that it seems like you are having problems tokenizing the string properly.
I have a text file with 1000 lines in the following format:
19 x 75 Bullnose Architrave/Skirting £1.02
I am writing a method that reads the file line by line in - This works OK.
I then want to split each string using the "£" as a deliminater & write it out to
an ArrayList<String> in the following format:
19 x 75 Bullnose Architrave/Skirting, Metre, 1.02
This is how I have approached it (productList is the ArrayList, declared/instantiated outside the try block):
try{
br = new BufferedReader(new FileReader(aFile));
String inputLine = br.readLine();
String delim = "£";
while (inputLine != null){
String[]halved = inputLine.split(delim, 2);
String lineOut = halved[0] + ", Metre, " + halved[1];//Array out of bounds
productList.add(lineOut);
inputLine = br.readLine();
}
}
The String is not splitting and I keep getting an ArrayIndexOutOfBoundsException. I'm not very familiar with regex. I've also tried using the old StringTokenizer but get the same result.
Is there an issue with £ as a delim or is it something else? I did wonder if it is something to do with the second token not being read as a String?
Any ideas would be helpful.
Here are some of the possible causes:
The encoding of the file doesn't match the encoding that you are using to read it, and the "pound" character in the file is getting "mangled" into something else.
The file and your source code are using different pound-like characters. For instance, Unicode has two code points that look like a "pound sign" - the Pound Sterling character (00A3) and the Lira character (2084) ... then there is the Roman semuncia character (10192).
You are trying to compile a UTF-8 encoded source file without tell the compiler that it is UTF-8 encoded.
Judging from your comments, this is an encoding mismatch problem; i.e. the "default" encoding being used by Java doesn't match the actual encoding of the file. There are two ways to address this:
Change the encoding of the file to match Java's default encoding. You seem to have tried that and failed. (And it wouldn't be the way I'd do this ...)
Change the program to open the file with a specific (non default) encoding; e.g. change
new FileReader(aFile)
to
new FileReader(aFile, encoding)
where encoding is the name of the file's actual character encoding. The names of the encodings understood by Java are listed here, but my guess is that it is "ISO-8859-1" (aka Latin-1).
This is probably a case of encoding mismatch. To check for this,
Print delim.length and make sure it is 1.
Print inputLine.length and make sure it is the right value (42).
If one of them is not the expected value then you have to make sure you are using UTF-8 everywhere.
You say delim.length is 1, so this is good. On the other hand if inputLine.length is 34, this is very wrong. For "19 x 75 Bullnose Architrave/Skirting £1.02" you should get 42 if all was as expected. If your file was UTF-8 encoded but read as ISO-8859-1 or similar you would have gotten 43.
Now I am a little at a loss. To debug this you could print individually each character of the string and check what is wrong with them.
for (int i = 0; i < inputLine.length; i++)
System.err.println("debug: " + i + ": " + inputLine.charAt(i) + " (" + inputLine.codePointAt(i) + ")");
Many thanks for all your replies.
Specifying the encoding within the read & saving the original text file as UTF -8 has worked.
However, the experience has taught me that delimiting text using "£" or indeed other characters that may have multiple representations in different encodings is a poor strategy.
I have decided to take a different approach:
1) Find the last space in the input string & replace it with "xxx" or similar.
2) Split this using the delimiter "xxx." which should split the strings & rip out the "£".
3) Carry on..