ISO-8859-1 to UTF-8 in Java (runescape API)

ISO-8859-1 to UTF-8 in Java (runescape API) - java

I am trying to make a Discord bot which gets informatie from the Runescape API and returns information about the user. The issue i have is when a username has a space involved.
The runescape api gives a file in ISO-8859-1 and i try to convert it to UTF-8
2 examples from the file: lil Jimmy and lil jessica.
The loop finds a match for jessica, but not for jimmy.
The code for getting and reading the file:
InputStream input = null;
InputStreamReader inputReader = null;
BufferedReader reader = null;
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
input = url.openConnection().getInputStream();
inputReader = new InputStreamReader(input, "ISO-8859-1");
reader = new BufferedReader(inputReader);
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
Does anyone know what im doing wrong? Thank you in advance for taking the time to help!
Edit 1: I've added the "ISO-8859-1" to inputReader as told by the answers. Now the next step is to replace the non wrapping white space with regular whit spaces.
Edit 2: The non breaking whitespace can be solved by:
parts[0] = parts[0].replaceAll("\u00a0","aaaaaaaaa");
parts[0] = parts[0].replaceAll("\u00C2","bbbbbbbbb");
parts[0] = parts[0].replaceAll("bbbbbbbbbaaaaaaaaa", " ");
The aaaaaa replaces the nonbreaking space for a regular one, and the aaaaa removes the roman a (Â) it places in front of the whitespace.
Thanks everyone for helping me out!

If you want to ensure that you're reading the data correctly, use:
inputReader = InputStreamReader(input, "ISO-8859-1");
After that, I'm not sure why you're trying to convert to UTF-8, since you're just using the text as Strings from that point on. A string itself doesn't have an encoding. (Well, in a certain sense a Java string is like UTF-16 in its internal representation, but that's a whole other can of worms you don't need to worry about here.)

First you are not providing the charset in your InputStreamReader which cause it to use the default charset instead of the one it should be using, and then you are doing crazy stuff to try and fix it that you shouldn't have to do and that won't work properly.
Also you are not closing the opened stream, you should be using try-with-resources.
It should probably look more like this:
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
try(BufferedReader inputReader = new BufferedReader(new InputStreamReader(url.openConnection().getInputStream(), StandardCharsets.ISO_8859_1))) {
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
}
}

Looking at the downloaded text file:
The whitespace for "lil jessica" is a regular space (U+0020), the one for "lil Jimmy" (and most of the others as well) is a non-breaking space (U+00A0).
If you don't care for breaking or non-breaking, the easiest approach is probably to replace it with a regular white space in your input string. Something like:
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
parts[0] = parts[0].replaceAll("\u00a0"," ");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}

Related

Android's BreakIterator considers line breaks as sentence delimiters

I have a unix text file that I want to read in my Android app and split it into sentences. However I noticed that BreakIterator considers some line break characters as sentence delimiters.
I use the following code to read the file and split it into senteces (only the first sentence is output for presentation purpose):
File file = new File...
String text = "";
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(Locale.US);
try {
FileInputStream inputStream = new FileInputStream(file);
InputStreamReader inputStreamReader = new InputStreamReader(inputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
String line;
StringBuilder stringBuilder = new StringBuilder();
while ((line = bufferedReader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append('\n');
}
inputStream.close();
text = stringBuilder.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
sentenceIterator.setText(text);
int end = sentenceIterator.next();
System.out.println(end);
System.out.println(text.substring(0, end));
But if I compile and run the code from Eclipse as a Desktop app the text is split correctly. I don't understand why it doesn't do the same on Android app.
I tried to convert the text file to dos format, I even tried to read the file and preserve original line breaks:
Pattern pat = Pattern.compile(".*\\R|.+\\z");
StringBuilder stringBuilder = new StringBuilder();
try (Scanner in = new Scanner(file, "UTF-8")) {
String line;
while ((line = in.findWithinHorizon(pat, 0)) != null) {
stringBuilder.append(line);
}
text = stringBuilder.toString();
sentenceIterator.setText(text);
int end = sentenceIterator.next();
System.out.println(end);
System.out.println(text.substring(0, end));
}
but without success. Any ideas?
You can download an excerpt from the file (unix format) here: http://dropmefiles.com/TZgBp
I've just noticed that it can be reproduced without download of this file. Just create a string that has line breaks inside sentences (e.g. "Hello, \nworld!") and run an instrumented test. If BreakIterator is used in a usual test then it splits correctly.
I expect 2 sentences:
sentence 1:
Foreword
IF a colleague were to say to you, Spouse of me this night today
manufactures the unusual meal in a home.
sentence 2:
You will join?
Yes, they don't look great but at least you know why it is so (sentence delimiters are ?. etc.). But if the code runs on Android it creates a sentence even from
Foreword
for some reason...
I'm not sure whether it is a bug, or whether there is a workaround for this. But in my eyes it makes Android version of BreakIterator as sentence splitter useless as it is normal for sentences in books to spread over multiple lines.
In all the experiments I've used the same import java.text.BreakIterator;

This is not really an answer but it might give you some insights.
It is not a file encoding issue, I tried it it his way and have the same faulty behaviour.
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(Locale.US);
String text = "Foreword\nIf a colleague were to say to you, Spouse of me this night today manufactures the unusual meal in a home. You will join?";
sentenceIterator.setText(text);
Android does not use the same Java version as your computer
I noticed that when I printout the class of the sentenceIterator object
sentenceIterator.getClass()
I have different classes when running with IntelliJ and when running on Android:
Running with IntelliJ:
sun.util.locale.provider.RuleBasedBreakIterator
Running on Android:
java.text.RuleBasedBreakIterator
sun.util.locale.provider.RuleBasedBreakIterator has the behaviour you want.
I don't know how to get Android to use the good RuleBasedBreakIterator class. I don't even know if it is possible.

Java code reads UTF-8 text incorrectly

I'm having a problem reading UTF-8 characters in my code (running on Eclipse).
I have a file text which has a few lines in it, for example:
אך 1234
NOTE: There is a \t before the word, and the word should appear on the left, the number on the right... I don't know how to reverse them here, sorry.
That is, a Hebrew word and then a number.
I need to separate the word from the number somehow. I tried this:
BufferedReader br = new BufferedReader(new FileReader(text));
String content;
while ((content = br.readLine()) != null)
{
String delims = "[ ]+";
String[] tokens = content.split(delims);
}
The problem is that for some reason, the code reads content (the first line in the file) as follows:
אך\t1234
...meaning that the space isn't in its correct place.
I suppose I could tokenize the text using the \t, but I'm not sure I should do it, as the file isn't being read correctly...
Does anyone have any idea why this happens?
Thanks so much :-)

I think you are matching a space when there actually is a tab there?
Can you try this:
BufferedReader br = new BufferedReader(new FileReader(text));
String content;
while ((content = br.readLine()) != null)
{
String delims = "\\s";
String[] tokens = content.split(delims);
}

Reading file to String in Java results in invisible characters

I'm having trouble around reading from a text file into a String in Java. I have a text file (created in Eclipse, if that matters) that contains a short amount of text -- approximately 98 characters. Reading that file to a String via several methods results in a String that is quite a bit longer -- 1621 characters. All but the relevant 98 are invisible in the debugger/console.
I've tried the following methods to load the String:
apache commons-io:
FileUtils.readFileToString(new File(path));
FileUtils.readFileToString(new File(path), "UTF-8");
byte[] b = FileUtils.readFileToByteArray(new File(path);
new String(b, "UTF-8");
byte[] b = FileUtils.readFileToByteArray(new File(path);
Charset.defaultCharset().decode(ByteBuffer.wrap(bytes)).toString();
NIO:
new String(Files.readAllBytes(path);
And so on.
Is there a method to strip away these control chars? Is there a way to read files to strings that doesn't have this issue?
As noted in the comments below, this behavior is due to a corrupted(?) file generated by Eclipse. I'd still be interested in hearing any strategies for trimming away control characters from Strings, though!

If you want to strip out all non-printable characters, try this
str = str.replaceAll("[^\\p{Graph}\n\r\t ]", "");
The regex matches all "invisible" characters, except ones we want to keep; in this case newline chars, tabs and spaces.
\p{Graph} is a POSIX character class for all printable/visible characters. To negate a POSIX character class, we can use capital P, ie P{Graph} (all non-printable/invisible characters), however we need to not exclude newlines etc, so we need [^\\p{Graph}\n\r\t] .

Read it line by line into a StringBuilder, and then convert it to a String:
StringBuilder sb = new StringBuilder();
BufferedReader file = new BufferedReader(new FileReader(fileName));
while (true)
{
String line = file.readLine();
if (line == null)
break;
sb.append(line+"\n");
}
file.close();
return sb.toString();

How to split a very long string

I have big file (about 30mb) and here the code I use to read data from the file
BufferedReader br = new BufferedReader(new FileReader(file));
try {
String line = br.readLine();
while (line != null) {
sb.append(line).append("\n");
line = br.readLine();
}
Then I need to split the content I read, so I use
String[] inst = sb.toString().split("GO");
The problem is that sometimes the sub-string is over the maximum String length and I can't get all the data inside the string. How can I get rid of this?
Thanks

Scanner s = new Scanner(input).useDelimiter("GO"); and use s.next()

WHY PART:- The erroneous result may be the outcome of non contiguous heap segment as the CMS collector doesn't de-fragment memory.
(It does not answer your how to solve part though).
You may opt for loading the whole string partwise, i.e using substring

java read properties and xml file using stringbuilder

I need to read a set of xml and property files and parse the data. Currently I am using inputstream ans string builder to do this. But this does not create the file in the same way as input file is. I donot want to remove the white spaces and new lines. How do i achieve this.
is = test.getInputStream();
br = new BufferedReader(new InputStreamReader(is));
String line5;
StringBuilder sb5 = new StringBuilder();
while ((line5 = br.readLine()) != null) {
sb5.append(line5);
}
String s = sb5.toString();
My output is:
#test 123 #test2 345
Expected output is:
#test
123
#test2
345
Any thoughts ? Thanks

br.readLine() consumes the line breaks, you need to add them to your StringBuilder after appending the line.
is = test.getInputStream();
br = new BufferedReader(new InputStreamReader(is));
String line5;
StringBuilder sb5 = new StringBuilder();
while ((line5 = br.readLine()) != null) {
sb5.append(line5);
sb5.append("\n");
}
If you want an extremely simple solution for reading a file to a String, Apache Commons-IO has a method for performing such a task (org.apache.commons.io.FileUtils).
FileUtils.readFileToString(File file, String encoding);

readLine() method doesn't add the EOL character (\n). So while appending the string to the builder, you need to add the EOL char, like sb5.append(line5+"\n");

The various readLine methods discard the newline from the input.
From the BufferedReader docs:
Returns: A String containing the contents of the line, not including any line-termination characters, or null if the end of the stream has been reached
A solution may be as simple as adding back a newline to your StringBuilder for every readLine: sb5.append(line5 + "\n");.
A better alternative is to read into an intermediate buffer first, using the read method, supplying your own char[]. You can still use StringBuilder.append, and get a String will match the file contents.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

ISO-8859-1 to UTF-8 in Java (runescape API) - java

Related

Android's BreakIterator considers line breaks as sentence delimiters

Java code reads UTF-8 text incorrectly

Reading file to String in Java results in invisible characters

How to split a very long string

java read properties and xml file using stringbuilder

Categories

Resources