How to read special characters from file system in libgdx - java

String jsonData = Gdx.files.internal("data/" + spreadsheet + ".json").readString();
When try to print this String , characters such as ö,ä,ü , show up as √º and other characters not similar to original for e.g. ü shows up as √º.
How can I remedy this?
I want to serialise this later into a class instance.
Should I use some other method instead of readString?
Something else I tried - I passed the filehandle itself to the Json object to serialise, but still characters show up as some other characters

I specified the charset and the problem is solved now
....readString("UTF-8");

Related

How to handle  (object replacement character) in URL

Using Jsoup to scrape URLS and one of the URLS I keep getting has this  symbol in it. I have tried decoding the URL:
url = URLDecoder.decode(url, "UTF-8" );
but it still remains in the code looking like this:
I cant find much online about this other than it is "The object replacement character, sometimes used to represent an embedded object in a document when it is converted to plain text."
But if this is the case I should be able to print the symbol if it is plain text but when I run
System.out.println("");
I get the following complication error:
and it reverts back to the last save.
Sample URL: https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/
NOTE: If you decode the url then compare it to the decoded url it comes back as not the same e.g.:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
if(url.contains("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles?/")){
System.out.println("The same");
}else {
System.out.println("Not the same");
}
That's not a compilation error. That's the eclipse code editor telling you it can't save the source code to a file, because you have told it to save the file in a cp1252 encoding, but that encoding can't express a .
Put differently, your development environment is currently configured to store source code in the cp1252 encoding, which doesn't support the character you want, so you either configure your development environment to store source code using a more flexible encoding (such as UTF-8 the error message suggests), or avoid having that character in your source code, for instance by using its unicode escape sequence instead:
System.out.println("\ufffc");
Note that as far as the Java language and runtime are concerned,  is a character like any other, so there may not be a particular need to "handle" it. Also, I am unsure why you'd expect URLDecoder to do anything if the URL hasn't been URL-encoded to begin with.
"ef bf bc" is a 3 bytes UTF-8 character so as the error says, there's no representation for that character in "CP1252" Windows page encoding.
An option could be to replace that percent encoding sequence with an ascii representation to make the filename for saving:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/".replace("%ef%bf%bc", "-xEFxBFxBC"), "UTF-8");
url ==> "https://www.breightgroup.com/job/hse-advisor-emb ... contract-roles-xEFxBFxBC/"
Another option using CharsetDecoder
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
CharsetDecoder decoder = Charset.forName("CP1252").newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE);
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
ByteBuffer buffer = ByteBuffer.wrap(urlDec.getBytes(Charset.forName("UTF-8")));
decoder.decode(buffer).toString();
Result
"https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles/"
I found the issue resolved by just replacing URLs with this symbol because there are other URLs with Unicode symbols that were invisible that couldnt be converted ect..
So I just compared the urls to the following regex if it returns false then I just bypass it. Hope this helps someone out:
boolean newURL = url.matches("^[a-zA-Z0-9_:;/.&|%!+=#?-]*$");

How to print the symbols and < and > to a file as space < and > respectively

How do you print symbols in Java to a file when you have only the symbol description?
I received a string from DB2 which contains symbols.
Two samples:
1) <0800>
2) 51V 3801Z
Such a string goes to two different places. One is a JSP rendering it as HTML. That is perfect; I get <0800> and 51V 3801Z, respectively. The other place is a CSV file created with java.io.FileWriter, and it does not convert to "<", ">", and " ". Instead, it is printed exactly as it came from DB2:
<0800>
and 51V 3801Z.
Is there anything the "new" nio library could help me? I have tried apache.commons.lang3.StringScapeUtils.escapeHTML4 without success.
I suggest looking into Apache's StringEscapeUtils, namely the unescapeHtml4() method.
Example:
String input = "<0800>";
String output = StringEscapeUtils.unescapeHtml4(input);
Ensure you are using the unescapeHtml4 method, and not the regular escapeHtml4 method!

Why is my UTF-8 encoded data not staying ?UTF-8? encoded?

The problem I'm trying to fix is this:
Users of our application are copy/pasting characters from windows-related docs like Word for instance, and our application is not recognizing single and double quotes or bullets.
These are the steps I've taken so far to get this data into UTF format:
inside servers.xml, in Connector tag, I added the attribute URIEncoding="UTF-8".
in the bean charged with storing the input, I created a byte[] and passed in String holding inputNote text, then converted it to UTF-8. Then passed the UTF-8 converted String back to inputNoteText String. Please see directly below for condensed code on this.
byte[] bytesInUTF8inputNoteText = inputNoteText.getBytes("UTF-8");
inputNoteText = new String(bytesInUTF8inputNoteText, "UTF-8");
this.var = inputNoteText;
In the variable-setter charged with holding the result from the db query:
setNoteText(noteText) to convert the note data coming from database query into bytes in UTF8 format, then converted it back into a String and set it to String noteText property. Also below.
public void setNoteText(String noteText) throws UnsupportedEncodingException {
byte[] bytesInUTF8inputNoteText = noteText.getBytes("UTF-8");
String noteTextUTF8 = new String(bytesInUTF8inputNoteText, "UTF-8");
this.noteText = noteTextUTF8;}
In SQL Server I changed the data type from text to nvarchar(MAX) to store the data in Unicode, even though that is a different type of Unicode.
What I see when I copy/paste from a MS Word doc into our JSF input textbox:
In Eclipse if I set a watch on the property in the bean, once the data in that String property has been converted into UTF-8, all characters are in UTF-8 format. When I post to to SQL Server the string of data held in nvarchar(max) datatype shows all characters in UTF-8 format correctly. Then when the resultSet is returned and the holding property is populated with the String returned from the db query, it also shows as all being correctly formatted in UTF-8....BUT,...somewhere in between the correct string value that's sitting in the property that's tied into the JSF page and the JSF page, 1.2 by the way, the value is being unformatted so that I see question marks where I should see single/double quotes and bullet points. I hope that someone has run into this type of issue before and can shed some light on what I need to do to fix this. Seems kind of like a JSF bug, thanks in advance for your input!!
try this
String noteText = new String (noteText.getBytes ("iso-8859-1"), "UTF-8");
When you copy paste from windows documents, the encoding format is not UTF-8 but [Windows-1252] (http://en.wikipedia.org/wiki/Windows-1252). Note the cells marked in thick green borders. These chars DONT map to UTF-8 charset and so you will have to use Windows-1252 encoding while reading.

why is String.split("£", 2) not working?

I have a text file with 1000 lines in the following format:
19 x 75 Bullnose Architrave/Skirting £1.02
I am writing a method that reads the file line by line in - This works OK.
I then want to split each string using the "£" as a deliminater & write it out to
an ArrayList<String> in the following format:
19 x 75 Bullnose Architrave/Skirting, Metre, 1.02
This is how I have approached it (productList is the ArrayList, declared/instantiated outside the try block):
try{
br = new BufferedReader(new FileReader(aFile));
String inputLine = br.readLine();
String delim = "£";
while (inputLine != null){
String[]halved = inputLine.split(delim, 2);
String lineOut = halved[0] + ", Metre, " + halved[1];//Array out of bounds
productList.add(lineOut);
inputLine = br.readLine();
}
}
The String is not splitting and I keep getting an ArrayIndexOutOfBoundsException. I'm not very familiar with regex. I've also tried using the old StringTokenizer but get the same result.
Is there an issue with £ as a delim or is it something else? I did wonder if it is something to do with the second token not being read as a String?
Any ideas would be helpful.
Here are some of the possible causes:
The encoding of the file doesn't match the encoding that you are using to read it, and the "pound" character in the file is getting "mangled" into something else.
The file and your source code are using different pound-like characters. For instance, Unicode has two code points that look like a "pound sign" - the Pound Sterling character (00A3) and the Lira character (2084) ... then there is the Roman semuncia character (10192).
You are trying to compile a UTF-8 encoded source file without tell the compiler that it is UTF-8 encoded.
Judging from your comments, this is an encoding mismatch problem; i.e. the "default" encoding being used by Java doesn't match the actual encoding of the file. There are two ways to address this:
Change the encoding of the file to match Java's default encoding. You seem to have tried that and failed. (And it wouldn't be the way I'd do this ...)
Change the program to open the file with a specific (non default) encoding; e.g. change
new FileReader(aFile)
to
new FileReader(aFile, encoding)
where encoding is the name of the file's actual character encoding. The names of the encodings understood by Java are listed here, but my guess is that it is "ISO-8859-1" (aka Latin-1).
This is probably a case of encoding mismatch. To check for this,
Print delim.length and make sure it is 1.
Print inputLine.length and make sure it is the right value (42).
If one of them is not the expected value then you have to make sure you are using UTF-8 everywhere.
You say delim.length is 1, so this is good. On the other hand if inputLine.length is 34, this is very wrong. For "19 x 75 Bullnose Architrave/Skirting £1.02" you should get 42 if all was as expected. If your file was UTF-8 encoded but read as ISO-8859-1 or similar you would have gotten 43.
Now I am a little at a loss. To debug this you could print individually each character of the string and check what is wrong with them.
for (int i = 0; i < inputLine.length; i++)
System.err.println("debug: " + i + ": " + inputLine.charAt(i) + " (" + inputLine.codePointAt(i) + ")");
Many thanks for all your replies.
Specifying the encoding within the read & saving the original text file as UTF -8 has worked.
However, the experience has taught me that delimiting text using "£" or indeed other characters that may have multiple representations in different encodings is a poor strategy.
I have decided to take a different approach:
1) Find the last space in the input string & replace it with "xxx" or similar.
2) Split this using the delimiter "xxx." which should split the strings & rip out the "£".
3) Carry on..

Converting from Java String to Windows-1252 Format

I want to send a URL request, but the parameter values in the URL can have french characters (eg. è). How do I convert from a Java String to Windows-1252 format (which supports the French characters)?
I am currently doing this:
String encodedURL = new String (unencodedUrl.getBytes("UTF-8"), "Windows-1252");
However, it makes:
param=Stationnement extèrieur into param=Stationnement extérieur .
How do I fix this? Any suggestions?
Edit for further clarification:
The user chooses values from a drop down. When the language is French, the values from the drop down sometimes include French characters, like 'è'. When I send this request to the server, it fails, saying it is unable to decipher the request. I have to figure out how to send the 'è' as a different format (preferably Windows-1252) that supports French characters. I have chosen to send as Windows-1252. The server will accept this format. I don't want to replace each character, because I could miss a special character, and then the server will throw an exception.
Use URLEncoder to encode parameter values as application/x-www-form-urlencoded data:
String param = "param="
+ URLEncoder.encode("Stationnement extr\u00e8ieur", "cp1252");
See here for an expanded explanation.
Try using
String encodedURL = new String (unencodedUrl.getBytes("UTF-8"), Charset.forName("Windows-1252"));
As per McDowell's suggestion, I tried encoding doing:
URLEncoder.encode("stringValueWithFrechCharacters", "cp1252") but it didn't work perfectly. I replayced "cp1252" with HTTP.ISO_8859_1 because I believe Android does not have the support for Windows-1252 yet. It does allow for ISO_8859_1, and after reading here, this supports MOST of the French characters, with the exception of 'Œ', 'œ', and 'Ÿ'.
So doing this made it work:
URLEncoder.encode(frenchString, HTTP.ISO_8859_1);
Works perfectly!

Categories