Java - regex match in String vs. readline in file

Java - regex match in String vs. readline in file - java

I don't understand this strange behaviour of regex match in Java. I'm working in Eclipse...
I have a .txt file encoded in UTF-8, where are blocks of text lines divided by these identifiers:
[a]
.
.
.
[b]
.
.
[c]
and so on...
My program is reading this file with this BufferedReader:
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream("file.txt", Charset.forName("UTF-8"))); }
catch (FileNotFoundException e) { System.out.println("File not found!"); e.printStackTrace(); System.exit(0); }
This should find the first identifier:
String line = "";
while (!line.matches("^\\[.\\]$")) { line = reader.readLine(); }
But it is instantly skipped!
But when I try to test the regex "manually", it works:
String line = "[a]";
if (line.matches("^\\[.\\]$")) { System.out.println("Regex matches"); }
It is possible that it is some trivial problem, but I get totally stuck at this point!
Thanks in advance for reply!
EDIT:
Well, I changed the encoding of text file to "ANSI" and it just started to work fine - OH MY GOD - why?! So there must be problem with the reader - I will try to find it out as soon as possible and edit my question.
So when the encoding of text file is "UTF-8", the regex doesn't match the first line of text file, where is "[a]" and matches the next identifier few lines below. What is wrong?
EDIT 2:
LOL I can't trust the Windows Notepad anymore - I had saved that file in it...and a few moments ago I saved that file using PSPad editor and now it works fine!

First of all FileInputStream does not have a constructor with parameters file and character set (JavaDoc JDK). Maybe you got a typo in your code?
new FileInputStream("file.txt", Charset.forName("UTF-8"))
So is this a different implementation of FileInputStream?
Second I tried this little test with a bracket modification:
BufferedReader reader;
reader = new BufferedReader(new InputStreamReader(
new FileInputStream("target/classes/test.txt"),
Charset.forName("UTF-8")));
String line = "";
while (!line.matches("^\\[.\\]$")) { line = reader.readLine(); }
System.out.println(line);
And it just worked and printed:
[a]

Related

ISO-8859-1 to UTF-8 in Java (runescape API)

I am trying to make a Discord bot which gets informatie from the Runescape API and returns information about the user. The issue i have is when a username has a space involved.
The runescape api gives a file in ISO-8859-1 and i try to convert it to UTF-8
2 examples from the file: lil Jimmy and lil jessica.
The loop finds a match for jessica, but not for jimmy.
The code for getting and reading the file:
InputStream input = null;
InputStreamReader inputReader = null;
BufferedReader reader = null;
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
input = url.openConnection().getInputStream();
inputReader = new InputStreamReader(input, "ISO-8859-1");
reader = new BufferedReader(inputReader);
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
Does anyone know what im doing wrong? Thank you in advance for taking the time to help!
Edit 1: I've added the "ISO-8859-1" to inputReader as told by the answers. Now the next step is to replace the non wrapping white space with regular whit spaces.
Edit 2: The non breaking whitespace can be solved by:
parts[0] = parts[0].replaceAll("\u00a0","aaaaaaaaa");
parts[0] = parts[0].replaceAll("\u00C2","bbbbbbbbb");
parts[0] = parts[0].replaceAll("bbbbbbbbbaaaaaaaaa", " ");
The aaaaaa replaces the nonbreaking space for a regular one, and the aaaaa removes the roman a (Â) it places in front of the whitespace.
Thanks everyone for helping me out!

If you want to ensure that you're reading the data correctly, use:
inputReader = InputStreamReader(input, "ISO-8859-1");
After that, I'm not sure why you're trying to convert to UTF-8, since you're just using the text as Strings from that point on. A string itself doesn't have an encoding. (Well, in a certain sense a Java string is like UTF-16 in its internal representation, but that's a whole other can of worms you don't need to worry about here.)

First you are not providing the charset in your InputStreamReader which cause it to use the default charset instead of the one it should be using, and then you are doing crazy stuff to try and fix it that you shouldn't have to do and that won't work properly.
Also you are not closing the opened stream, you should be using try-with-resources.
It should probably look more like this:
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
try(BufferedReader inputReader = new BufferedReader(new InputStreamReader(url.openConnection().getInputStream(), StandardCharsets.ISO_8859_1))) {
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
}
}

Looking at the downloaded text file:
The whitespace for "lil jessica" is a regular space (U+0020), the one for "lil Jimmy" (and most of the others as well) is a non-breaking space (U+00A0).
If you don't care for breaking or non-breaking, the easiest approach is probably to replace it with a regular white space in your input string. Something like:
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
parts[0] = parts[0].replaceAll("\u00a0"," ");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}

Android's BreakIterator considers line breaks as sentence delimiters

I have a unix text file that I want to read in my Android app and split it into sentences. However I noticed that BreakIterator considers some line break characters as sentence delimiters.
I use the following code to read the file and split it into senteces (only the first sentence is output for presentation purpose):
File file = new File...
String text = "";
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(Locale.US);
try {
FileInputStream inputStream = new FileInputStream(file);
InputStreamReader inputStreamReader = new InputStreamReader(inputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
String line;
StringBuilder stringBuilder = new StringBuilder();
while ((line = bufferedReader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append('\n');
}
inputStream.close();
text = stringBuilder.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
sentenceIterator.setText(text);
int end = sentenceIterator.next();
System.out.println(end);
System.out.println(text.substring(0, end));
But if I compile and run the code from Eclipse as a Desktop app the text is split correctly. I don't understand why it doesn't do the same on Android app.
I tried to convert the text file to dos format, I even tried to read the file and preserve original line breaks:
Pattern pat = Pattern.compile(".*\\R|.+\\z");
StringBuilder stringBuilder = new StringBuilder();
try (Scanner in = new Scanner(file, "UTF-8")) {
String line;
while ((line = in.findWithinHorizon(pat, 0)) != null) {
stringBuilder.append(line);
}
text = stringBuilder.toString();
sentenceIterator.setText(text);
int end = sentenceIterator.next();
System.out.println(end);
System.out.println(text.substring(0, end));
}
but without success. Any ideas?
You can download an excerpt from the file (unix format) here: http://dropmefiles.com/TZgBp
I've just noticed that it can be reproduced without download of this file. Just create a string that has line breaks inside sentences (e.g. "Hello, \nworld!") and run an instrumented test. If BreakIterator is used in a usual test then it splits correctly.
I expect 2 sentences:
sentence 1:
Foreword
IF a colleague were to say to you, Spouse of me this night today
manufactures the unusual meal in a home.
sentence 2:
You will join?
Yes, they don't look great but at least you know why it is so (sentence delimiters are ?. etc.). But if the code runs on Android it creates a sentence even from
Foreword
for some reason...
I'm not sure whether it is a bug, or whether there is a workaround for this. But in my eyes it makes Android version of BreakIterator as sentence splitter useless as it is normal for sentences in books to spread over multiple lines.
In all the experiments I've used the same import java.text.BreakIterator;

This is not really an answer but it might give you some insights.
It is not a file encoding issue, I tried it it his way and have the same faulty behaviour.
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(Locale.US);
String text = "Foreword\nIf a colleague were to say to you, Spouse of me this night today manufactures the unusual meal in a home. You will join?";
sentenceIterator.setText(text);
Android does not use the same Java version as your computer
I noticed that when I printout the class of the sentenceIterator object
sentenceIterator.getClass()
I have different classes when running with IntelliJ and when running on Android:
Running with IntelliJ:
sun.util.locale.provider.RuleBasedBreakIterator
Running on Android:
java.text.RuleBasedBreakIterator
sun.util.locale.provider.RuleBasedBreakIterator has the behaviour you want.
I don't know how to get Android to use the good RuleBasedBreakIterator class. I don't even know if it is possible.

Java charset - How to get correct input from System.in?

my first post here.
Well, i'm building a simple app for messaging through console(cmd and terminal), just for learning, but i'm got a problem while reader and writing the text with a charset.
Here is my initial code for sending message, the Main.CHARSET was setted to UTF-8:
Scanner teclado = new Scanner(System.in,Main.CHARSET);
BufferedWriter saida = new BufferedWriter(new OutputStreamWriter(new BufferedOutputStream(cliente.getOutputStream()),Main.CHARSET)));
saida.write(nick + " conectado!");
saida.flush();
while (teclado.hasNextLine()) {
saida.write(nick +": "+ s);
saida.flush();
}
And the receiving code:
try (BufferedReader br = new BufferedReader(new InputStreamReader(servidor,Main.CHARSET))){
String s;
while ((s = br.readLine()) != null) {
System.out.println(s);
}
}
When i send "olá" or anything like "ÁàçÇõÉ" (Brazilian portuguese), i got just blank spaces on windows cmd (not tested in linux).
So i teste the following code:
Scanner s = new Scanner(System.in,Main.CHARSET);
System.out.println(s.nextLine());
And for input "olá", printed "ol ".
the question is, how to read the console so that the input is read correctly , and can be transmitted to another user and be displayed correctly to him.

if you just wanna output portuguese in text file, it would be easy.
The only thing you have to care about is display by UTF-8 encoding.
you can use a really simple way like
String text = "olá";
FileWriter fw = new FileWriter("hello.txt");
fw.write(text);
fw.close();
Then open hello.txt by notepad or any text tool that support UTF-8
or you have to change your tool's default font into UTF-8.
If you want show it on console, I think pvg already answer you.
OK, seems you still get confuse on it.
here is a simple code you can try.
Scanner userInput = new Scanner(System.in);//type olá plz
String text = userInput.next();
System.out.println((int)text.charAt(2));//you will see output int is 63
char word = 'á'; // this word covert to int is 225
int a = 225;
System.out.println((int)word);// output 225
System.out.println((char)a); // output á
So, what is the conclusion?
If you use console to tpye in portuguese then catch it, you totally get different word, not a gibberish word.

Junk characters while reading text file in java

I have a java which calls windows bat file which does some processing and generates the output file.
Process p = Runtime.getRuntime().exec("cmd /c "+filename);
Now when reading the file from following program. (filexists() is function which checks whether file exists or not). Output file contains only single line
if ( filexists("output.txt") == true)
{ String FileLine;
FileInputStream fstream = new FileInputStream("output.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
FileLine = br.readLine();
fstream.close();
filein.close();
}
Variable FileLine contains 3 junk charcters in the starting. I also checked few other files in the progam and no file has this issue except for the fact it is created with Runtime function.
ï»¿9087.
As you can see three junk characters are coming in the output file. When opened with Notepad++, i am not able to see those junk characters.
Please suggest

This is happening because you have not mentioned the file encoding while creating your FileInputStream.Assuming your file is UTF-8 encoded, you need to do something like this
new FileInputStream("output.txt, "UTF-8"));
Change the encoding as per the encoding of your file

That looks like the byte order mark for UTF-8 encoding. See https://en.wikipedia.org/wiki/Byte_order_mark

May be its an issue with file encoding. Though I am not sure.
Can you please try following piece of code and see if it works for you
BufferedReader in = new BufferedReader(
new InputStreamReader( new FileInputStream("output.txt"), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}

Replace every quotation in a file by an escaping quotation with java

I am trying to edit a file with java.
I would like to escape every Quotation " in my file with \"
I tried it like this (regards to the other solution on stackoverflow, which code I could copy):
public void replaceInFile(File file) throws IOException {
File tempFile = new File("twittergeoUpdate.csv");
FileWriter fw = new FileWriter(tempFile);
Reader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while (br.ready()) {
fw.write(br.readLine().replaceAll("\"", "\\\"") + "\n");
}
fw.close();
br.close();
fr.close();
}
I was too fast... It doesn't work for me. The Quotation just stay untouched in my file. Any ideas ?

\\\" only escapes "(double quote), you have to escape the back-slashes aswell, thus you need 5 backslashes. \\\\\"
s.replaceAll("\"", "\\\\\"")

You should use StringEscapeUtils#escapeJava() from Apache commons-lang package.
Like this:
org.apache.commons.lang.StringEscapeUtils.escapeJava(<yourStringLiteralHere>)
From javadoc:
StringEscapeUtils#escapeJava() escapes the characters in a String using Java String rules.
Deals correctly with quotes and control-chars (tab, backslash, cr, ff, etc.)
So a tab becomes the characters '\' and 't'.
The only difference between Java strings and JavaScript strings is that in JavaScript, a single quote must be escaped.
Example:
input string: He didn't say, "Stop!"
output string: He didn't say, \"Stop!\"

I coded:
public void replaceInFile(File file) throws IOException {
File tempFile = new File("twittergeoUpdate.csv");
FileWriter fw = new FileWriter(tempFile);
Reader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while (br.ready()) {
fw.write(br.readLine().replaceAll("\"", "\\\\\"") + "\n");
}
fw.close();
br.close();
fr.close();
}//replaceInFile
The correct replacement string is \\\" (5 backslash, not 3)

The basic problem is that you cannot write to a file you are reading from and not expect it to change. In your case, the first thing FileWriter does is truncate the file. I have seen examples where the reader still manages to read something but it is corrupted.
You have to write to a temporary file, close both files and when finished replace (using delete and rename) your original file with the temporary one.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - regex match in String vs. readline in file - java

Related

ISO-8859-1 to UTF-8 in Java (runescape API)

Android's BreakIterator considers line breaks as sentence delimiters

Java charset - How to get correct input from System.in?

Junk characters while reading text file in java

Replace every quotation in a file by an escaping quotation with java

Categories

Resources