Java code reads UTF-8 text incorrectly

Java code reads UTF-8 text incorrectly - java

I'm having a problem reading UTF-8 characters in my code (running on Eclipse).
I have a file text which has a few lines in it, for example:
אך 1234
NOTE: There is a \t before the word, and the word should appear on the left, the number on the right... I don't know how to reverse them here, sorry.
That is, a Hebrew word and then a number.
I need to separate the word from the number somehow. I tried this:
BufferedReader br = new BufferedReader(new FileReader(text));
String content;
while ((content = br.readLine()) != null)
{
String delims = "[ ]+";
String[] tokens = content.split(delims);
}
The problem is that for some reason, the code reads content (the first line in the file) as follows:
אך\t1234
...meaning that the space isn't in its correct place.
I suppose I could tokenize the text using the \t, but I'm not sure I should do it, as the file isn't being read correctly...
Does anyone have any idea why this happens?
Thanks so much :-)

I think you are matching a space when there actually is a tab there?
Can you try this:
BufferedReader br = new BufferedReader(new FileReader(text));
String content;
while ((content = br.readLine()) != null)
{
String delims = "\\s";
String[] tokens = content.split(delims);
}

Related

ISO-8859-1 to UTF-8 in Java (runescape API)

I am trying to make a Discord bot which gets informatie from the Runescape API and returns information about the user. The issue i have is when a username has a space involved.
The runescape api gives a file in ISO-8859-1 and i try to convert it to UTF-8
2 examples from the file: lil Jimmy and lil jessica.
The loop finds a match for jessica, but not for jimmy.
The code for getting and reading the file:
InputStream input = null;
InputStreamReader inputReader = null;
BufferedReader reader = null;
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
input = url.openConnection().getInputStream();
inputReader = new InputStreamReader(input, "ISO-8859-1");
reader = new BufferedReader(inputReader);
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
Does anyone know what im doing wrong? Thank you in advance for taking the time to help!
Edit 1: I've added the "ISO-8859-1" to inputReader as told by the answers. Now the next step is to replace the non wrapping white space with regular whit spaces.
Edit 2: The non breaking whitespace can be solved by:
parts[0] = parts[0].replaceAll("\u00a0","aaaaaaaaa");
parts[0] = parts[0].replaceAll("\u00C2","bbbbbbbbb");
parts[0] = parts[0].replaceAll("bbbbbbbbbaaaaaaaaa", " ");
The aaaaaa replaces the nonbreaking space for a regular one, and the aaaaa removes the roman a (Â) it places in front of the whitespace.
Thanks everyone for helping me out!

If you want to ensure that you're reading the data correctly, use:
inputReader = InputStreamReader(input, "ISO-8859-1");
After that, I'm not sure why you're trying to convert to UTF-8, since you're just using the text as Strings from that point on. A string itself doesn't have an encoding. (Well, in a certain sense a Java string is like UTF-16 in its internal representation, but that's a whole other can of worms you don't need to worry about here.)

First you are not providing the charset in your InputStreamReader which cause it to use the default charset instead of the one it should be using, and then you are doing crazy stuff to try and fix it that you shouldn't have to do and that won't work properly.
Also you are not closing the opened stream, you should be using try-with-resources.
It should probably look more like this:
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
try(BufferedReader inputReader = new BufferedReader(new InputStreamReader(url.openConnection().getInputStream(), StandardCharsets.ISO_8859_1))) {
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
}
}

Looking at the downloaded text file:
The whitespace for "lil jessica" is a regular space (U+0020), the one for "lil Jimmy" (and most of the others as well) is a non-breaking space (U+00A0).
If you don't care for breaking or non-breaking, the easiest approach is probably to replace it with a regular white space in your input string. Something like:
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
parts[0] = parts[0].replaceAll("\u00a0"," ");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}

String search breaking when wrapping .jar with JSmooth

I've got an oddball problem here. I've got a little java program that filters Minecraft log files to make them easier to read. On each line of these logs, there are usually multiple instances of the character "§", which returns a hex value of FFFD.
I am filtering out this character (as well as the character following it) using:
currentLine = currentLine.replaceAll("\uFFFD.", "");
Now, when I run the program through NetBeans, it works swell. My lines get outputted looking like this:
CxndyAnnie: Mhm
CxndyAnnie: Sorry
But when I build the .jar file and wrap it into a .exe file using JSmooth, that character no longer gets filtered out when I run the .exe, and my lines come out looking like this:
§e§7[§f$65§7] §1§nCxndyAnnie§e: Mhm
§e§7[§f$65§7] §1§nCxndyAnnie§e: Sorry
(note: the additional square brackets and $65 show up because their filtering is dependent on the special character and it's following character being removed first)
Any ideas why this would no longer work after putting it through JSmooth? Is there a different way to do the text replace that would preserve its function?
By the way, I also attempted to remove this character using
currentLine = currentLine.replaceAll("§.", "");
but that didn't work in Netbeans nor as a .exe.
I'll go ahead and past the full method below:
public static String[] filterLines(String[] allLines, String filterType, Boolean timeStamps) throws IOException {
String currentLine = null;
FileWriter saveFile = new FileWriter("readable.txt");
String heading;
String string1 = "[L]";
String string2 = "[A]";
String string3 = "[G]";
if (filterType.equals(string1)) {
heading = "LOCAL CHAT LOGS ONLY \r\n\r\n";
}
else if (filterType.equals(string2)) {
heading = "ADVERTISING CHAT LOGS ONLY \r\n\r\n";
}
else if (filterType.equals(string3)) {
heading = "GLOBAL CHAT LOGS ONLY \r\n\r\n";
}
else {
heading = "CHAT LINES CONTAINING \"" + filterType + "\" \r\n\r\n";
}
saveFile.write(heading);
for (int i = 0; i < allLines.length; i++) {
if ((allLines[i] != null ) && (allLines[i].contains(filterType))) {
currentLine = allLines[i];
if (!timeStamps) {
currentLine = currentLine.replaceAll("\\[..:..:..\\].", "");
}
currentLine = currentLine.replaceAll("\\[Client thread/INFO\\]:.", "");
currentLine = currentLine.replaceAll("\\[CHAT\\].", "");
currentLine = currentLine.replaceAll("\uFFFD.", "");
currentLine = currentLine.replaceAll("\\[A\\].", "");
currentLine = currentLine.replaceAll("\\[L\\].", "");
currentLine = currentLine.replaceAll("\\[G\\].", "");
currentLine = currentLine.replaceAll("\\[\\$..\\].", "");
currentLine = currentLine.replaceAll(".>", ":");
currentLine = currentLine.replaceAll("\\[\\$100\\].", "");
saveFile.write(currentLine + "\r\n");
//System.out.println(currentLine);
}
}
saveFile.close();
ProcessBuilder openFile = new ProcessBuilder("Notepad.exe", "readable.txt");
openFile.start();
return allLines;
}
FINAL EDIT
Just in case anyone stumbles across this and needs to know what finally worked, here's the snippet of code where I pull the lines from the file and re-encode it to work:
BufferedReader fileLines;
fileLines = new BufferedReader(new FileReader(file));
String[] allLines = new String[numLines];
int i=0;
while ((line = fileLines.readLine()) != null) {
byte[] bLine = line.getBytes();
String convLine = new String(bLine, Charset.forName("UTF-8"));
allLines[i] = convLine;
i++;
}

I also had a problem like this in the past with minecroft logs, I don’t remember the exact details, but the issue came down to a file format problem, where UTF8 encoding worked correctly but some other text encoding including the system default did not work correctly.
First:
Make sure that you specify UTF8 encoding when reading the byteArray from file so that allLines contains the correct info like so:
Path fileLocation = Paths.get("C:/myFileLocation/logs.txt");
byte[] data = Files.readAllBytes(fileLocation);
String allLines = new String(data , Charset.forName("UTF-8"));
Second:
Using \uFFFD is not going to work, because \uFFFD is only used to replace an incoming character whose value is unknown or unrepresentable in Unicode.
However if you used the correct encoding (shown in my first point) then \uFFFD is not necessary because the value § is known in unicode so you can simply use
currentLine.replaceAll("§", "");
or specifically use the actual unicode string for that character U+00A7 like so
currentLine.replaceAll("\u00A7", "");
or just use both those lines in your code.

Read from file with BufferedReader

Basically I've got an assignment which reads multiple lines from a .txt file.
There are 4 values in the text file per line and each value is separated by 2 spaces.
There are about 10 lines of data in the file.
After taking the input from the file the program then puts it onto a Database. The database connection functionality works fine.
My issue now is with reading from the file using a BufferedReader.
The issue is that if I uncomment any 1 of the 3 lines at the bottom the BufferedReader reads every other line. And if I don't use them then there's an exception as the next input is of type String.
I have contemplated using a Scanner with the .hasNextLine() method.
Any thoughts on what could be the problem and how to fix it?
Thanks.
File file = new File(FILE_INPUT_NAME);
FileReader fr = new FileReader(file);
BufferedReader readFile = new BufferedReader(fr);
String line = null;
while ((line = readFile.readLine()) != null) {
String[] split = line.split(" ", 4);
String id = split[0];
nameFromFile = split[1];
String year = split[2];
String mark = split[3];
idFromFile = Integer.parseInt(id);
yearOfStudyFromFile = Integer.parseInt(year);
markFromFile = Integer.parseInt(mark);
//line = readFile.readLine();
//readFile.readLine();
//System.out.println(readFile.readLine());
}
Edit: There was an error in the formatting of the .txt file. a missing value.
But now I get an ArrayOutOfBoundsException.
Edit edit: Another error in the .txt file! Turns out there was a single space instead of a double. It seems to be working now. But any advice on how to deal with file errors like this in the future?

The issue is that if I uncomment any 1 of the 3 lines at the bottom the BufferedReader reads every other line.
Correct. If you put any of those lines of code in, the line of text read will be thrown away and not processed. You're already reading in the while condition. You don't need another read. If you put any of those lines in, they will be thrown away and not proce

A compilable version of the code posted could be
public void read() throws IOException {
File file = new File(FILE_INPUT_NAME);
FileReader fr = new FileReader(file);
BufferedReader readFile = new BufferedReader(fr);
String line;
while ((line = readFile.readLine()) != null) {
String[] split = line.split(" ", 4);
if (split.length != 4) { // Not enough tokens (e.g., empty line) read
continue;
}
String id = split[0];
String nameFromFile = split[1];
String year = split[2];
String mark = split[3];
int idFromFile = Integer.parseInt(id);
int yearOfStudyFromFile = Integer.parseInt(year);
int markFromFile = Integer.parseInt(mark);
//line = readFile.readLine();
//readFile.readLine();
//System.out.println(readFile.readLine());
}
}
The above uses a single space (" " instead of the original " "). To split on any number of changes, a regular expression can be used, e.g. "\\s+". Of course, exactly 2 spaces can also be used, if that reflects the structure of the input data.
What the method should do with the extracted values (e.g., returning them in an object of some type, or saving them to a database directly), is up to the application using it.

How to split a very long string

I have big file (about 30mb) and here the code I use to read data from the file
BufferedReader br = new BufferedReader(new FileReader(file));
try {
String line = br.readLine();
while (line != null) {
sb.append(line).append("\n");
line = br.readLine();
}
Then I need to split the content I read, so I use
String[] inst = sb.toString().split("GO");
The problem is that sometimes the sub-string is over the maximum String length and I can't get all the data inside the string. How can I get rid of this?
Thanks

Scanner s = new Scanner(input).useDelimiter("GO"); and use s.next()

WHY PART:- The erroneous result may be the outcome of non contiguous heap segment as the CMS collector doesn't de-fragment memory.
(It does not answer your how to solve part though).
You may opt for loading the whole string partwise, i.e using substring

Regular expression illegal character in Java

I've been looking through the Internet an after a big headache, cannon't find why this regular expression is wrong:
"\"\w*&&[\p{Punct}]\"["+sepChar+"]\"\w*&&[\p{Punct}]\""
I'm trying to read a master data file with the following pattern (quotes included):
"TEXTVALUE":"TEXTVALUE":"TEXTVALUE"
and split each line with the regular expression above.
So, for example:
"Hello:John":"Hello:World":"Hello:Mark"
will be splitted into:
{"Hello:John", "Hello:World", "Hello:Mark"}

The backwards slash is the escape character in Java. You need to use two backslashes \\ to include a single backslash in the regex.
Try:
"\"\\w*&&[\\p{Punct}]\"["+sepChar+"]\"\\w*&&[\\p{Punct}]\""

Ok.
Thanks to #kevin-bowersox for the help.
It seems that Oracle has done a great job improving Java with version 7.
With this code:
File file = new File(someFile);
BufferedReader br = new BufferedReader(file);
String line = null;
while((line = br.readLine()) != null){
//todo
}
If your file has been formatted with a constant patern, for example:
"TEXTVALUE":"TEXTVALUE":"TEXTVALUE"
It reads:
"TEXTVALUE-->TEXTVALUE-->TEXTVALUE"
where '-->' stands for tabs ('\t')
So, at the end, my solution is:
public ArrayList getSplittedTextFromFile(String filePath) throws FileNotFoundException, IOException{
ArrayList<String[]> ret = null;
if (!filePath.isEmpty()){
File input = new File(filePath);
BufferedReader br = new BufferedReader(input);
String line = null;
while((line = br.readLine()) != null){
String[] aSplit = line.split("\\t");
if (ret == null)
ret = new ArrayList<>();
ret.add(aSplit);
}//while
}//fi
}//fnc

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java code reads UTF-8 text incorrectly - java

I think you are matching a space when there actually is a tab there? Can you try this: BufferedReader br = new BufferedReader(new FileReader(text)); String content; while ((content = br.readLine()) != null) { String delims = "\\s"; String[] tokens = content.split(delims); }

Related

ISO-8859-1 to UTF-8 in Java (runescape API)

String search breaking when wrapping .jar with JSmooth

Read from file with BufferedReader

How to split a very long string

Regular expression illegal character in Java

Categories

Resources