Android's BreakIterator considers line breaks as sentence delimiters

Android's BreakIterator considers line breaks as sentence delimiters - java

I have a unix text file that I want to read in my Android app and split it into sentences. However I noticed that BreakIterator considers some line break characters as sentence delimiters.
I use the following code to read the file and split it into senteces (only the first sentence is output for presentation purpose):
File file = new File...
String text = "";
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(Locale.US);
try {
FileInputStream inputStream = new FileInputStream(file);
InputStreamReader inputStreamReader = new InputStreamReader(inputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
String line;
StringBuilder stringBuilder = new StringBuilder();
while ((line = bufferedReader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append('\n');
}
inputStream.close();
text = stringBuilder.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
sentenceIterator.setText(text);
int end = sentenceIterator.next();
System.out.println(end);
System.out.println(text.substring(0, end));
But if I compile and run the code from Eclipse as a Desktop app the text is split correctly. I don't understand why it doesn't do the same on Android app.
I tried to convert the text file to dos format, I even tried to read the file and preserve original line breaks:
Pattern pat = Pattern.compile(".*\\R|.+\\z");
StringBuilder stringBuilder = new StringBuilder();
try (Scanner in = new Scanner(file, "UTF-8")) {
String line;
while ((line = in.findWithinHorizon(pat, 0)) != null) {
stringBuilder.append(line);
}
text = stringBuilder.toString();
sentenceIterator.setText(text);
int end = sentenceIterator.next();
System.out.println(end);
System.out.println(text.substring(0, end));
}
but without success. Any ideas?
You can download an excerpt from the file (unix format) here: http://dropmefiles.com/TZgBp
I've just noticed that it can be reproduced without download of this file. Just create a string that has line breaks inside sentences (e.g. "Hello, \nworld!") and run an instrumented test. If BreakIterator is used in a usual test then it splits correctly.
I expect 2 sentences:
sentence 1:
Foreword
IF a colleague were to say to you, Spouse of me this night today
manufactures the unusual meal in a home.
sentence 2:
You will join?
Yes, they don't look great but at least you know why it is so (sentence delimiters are ?. etc.). But if the code runs on Android it creates a sentence even from
Foreword
for some reason...
I'm not sure whether it is a bug, or whether there is a workaround for this. But in my eyes it makes Android version of BreakIterator as sentence splitter useless as it is normal for sentences in books to spread over multiple lines.
In all the experiments I've used the same import java.text.BreakIterator;

This is not really an answer but it might give you some insights.
It is not a file encoding issue, I tried it it his way and have the same faulty behaviour.
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(Locale.US);
String text = "Foreword\nIf a colleague were to say to you, Spouse of me this night today manufactures the unusual meal in a home. You will join?";
sentenceIterator.setText(text);
Android does not use the same Java version as your computer
I noticed that when I printout the class of the sentenceIterator object
sentenceIterator.getClass()
I have different classes when running with IntelliJ and when running on Android:
Running with IntelliJ:
sun.util.locale.provider.RuleBasedBreakIterator
Running on Android:
java.text.RuleBasedBreakIterator
sun.util.locale.provider.RuleBasedBreakIterator has the behaviour you want.
I don't know how to get Android to use the good RuleBasedBreakIterator class. I don't even know if it is possible.

Related

ISO-8859-1 to UTF-8 in Java (runescape API)

I am trying to make a Discord bot which gets informatie from the Runescape API and returns information about the user. The issue i have is when a username has a space involved.
The runescape api gives a file in ISO-8859-1 and i try to convert it to UTF-8
2 examples from the file: lil Jimmy and lil jessica.
The loop finds a match for jessica, but not for jimmy.
The code for getting and reading the file:
InputStream input = null;
InputStreamReader inputReader = null;
BufferedReader reader = null;
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
input = url.openConnection().getInputStream();
inputReader = new InputStreamReader(input, "ISO-8859-1");
reader = new BufferedReader(inputReader);
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
Does anyone know what im doing wrong? Thank you in advance for taking the time to help!
Edit 1: I've added the "ISO-8859-1" to inputReader as told by the answers. Now the next step is to replace the non wrapping white space with regular whit spaces.
Edit 2: The non breaking whitespace can be solved by:
parts[0] = parts[0].replaceAll("\u00a0","aaaaaaaaa");
parts[0] = parts[0].replaceAll("\u00C2","bbbbbbbbb");
parts[0] = parts[0].replaceAll("bbbbbbbbbaaaaaaaaa", " ");
The aaaaaa replaces the nonbreaking space for a regular one, and the aaaaa removes the roman a (Â) it places in front of the whitespace.
Thanks everyone for helping me out!

If you want to ensure that you're reading the data correctly, use:
inputReader = InputStreamReader(input, "ISO-8859-1");
After that, I'm not sure why you're trying to convert to UTF-8, since you're just using the text as Strings from that point on. A string itself doesn't have an encoding. (Well, in a certain sense a Java string is like UTF-16 in its internal representation, but that's a whole other can of worms you don't need to worry about here.)

First you are not providing the charset in your InputStreamReader which cause it to use the default charset instead of the one it should be using, and then you are doing crazy stuff to try and fix it that you shouldn't have to do and that won't work properly.
Also you are not closing the opened stream, you should be using try-with-resources.
It should probably look more like this:
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
try(BufferedReader inputReader = new BufferedReader(new InputStreamReader(url.openConnection().getInputStream(), StandardCharsets.ISO_8859_1))) {
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
}
}

Looking at the downloaded text file:
The whitespace for "lil jessica" is a regular space (U+0020), the one for "lil Jimmy" (and most of the others as well) is a non-breaking space (U+00A0).
If you don't care for breaking or non-breaking, the easiest approach is probably to replace it with a regular white space in your input string. Something like:
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
parts[0] = parts[0].replaceAll("\u00a0"," ");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}

Java charset - How to get correct input from System.in?

my first post here.
Well, i'm building a simple app for messaging through console(cmd and terminal), just for learning, but i'm got a problem while reader and writing the text with a charset.
Here is my initial code for sending message, the Main.CHARSET was setted to UTF-8:
Scanner teclado = new Scanner(System.in,Main.CHARSET);
BufferedWriter saida = new BufferedWriter(new OutputStreamWriter(new BufferedOutputStream(cliente.getOutputStream()),Main.CHARSET)));
saida.write(nick + " conectado!");
saida.flush();
while (teclado.hasNextLine()) {
saida.write(nick +": "+ s);
saida.flush();
}
And the receiving code:
try (BufferedReader br = new BufferedReader(new InputStreamReader(servidor,Main.CHARSET))){
String s;
while ((s = br.readLine()) != null) {
System.out.println(s);
}
}
When i send "olá" or anything like "ÁàçÇõÉ" (Brazilian portuguese), i got just blank spaces on windows cmd (not tested in linux).
So i teste the following code:
Scanner s = new Scanner(System.in,Main.CHARSET);
System.out.println(s.nextLine());
And for input "olá", printed "ol ".
the question is, how to read the console so that the input is read correctly , and can be transmitted to another user and be displayed correctly to him.

if you just wanna output portuguese in text file, it would be easy.
The only thing you have to care about is display by UTF-8 encoding.
you can use a really simple way like
String text = "olá";
FileWriter fw = new FileWriter("hello.txt");
fw.write(text);
fw.close();
Then open hello.txt by notepad or any text tool that support UTF-8
or you have to change your tool's default font into UTF-8.
If you want show it on console, I think pvg already answer you.
OK, seems you still get confuse on it.
here is a simple code you can try.
Scanner userInput = new Scanner(System.in);//type olá plz
String text = userInput.next();
System.out.println((int)text.charAt(2));//you will see output int is 63
char word = 'á'; // this word covert to int is 225
int a = 225;
System.out.println((int)word);// output 225
System.out.println((char)a); // output á
So, what is the conclusion?
If you use console to tpye in portuguese then catch it, you totally get different word, not a gibberish word.

Scanner and Buffered reader not showing double quotation marks Java Android

I'm trying to load from a .txt file. Like most .txt files, it's UTF-8 encoded, so it shows double-quotation mark characters when I load it inside of eclipse.
The problem is, when I load my text file into bufferedreader (also set to UTF-8 encoding), it converts double quotation marks and a few other characters into question mark boxes on my device.
I can't figure out what could be the problem, searches here and on Google are all talking about Arabic characters. Please help.
edit: ... updating question... one minute
edit2: I'm displaying them inside a TextView.
The following is from a method. Scanner wasn't working either so I used this:
InputStream in;
in = getResources().openRawResource(R.id.text);
BufferedReader br = new BufferedReader(new InputStreamReader(in,Charset.forName("UTF-8")));
ArrayList<String> letters = new ArrayList<String>(25);
try {
String line="";
while((line = br.readLine()) != null){
String[] splited = line.split("\\s+");
int m = 0;
String word="";
while(m<splited.length){
m++; // analyze the word and do some other stuff here
letters.add(word);
}
}
in.close();
br.close();
} catch (IOException e) {
e.printStackTrace();
}
Here is where I display my text inside a handler:
final Textview txt = (TextView) rootView.findViewById(R.id.something);
// handler stuff, then inside the handler:
txt.setText(word, BufferType.SPANNABLE);
Spannable s = (Spannable)txt.getText();
s.setSpan(new ForegroundColorSpan(0xFFFFFFFF),2,3,Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);
I removed the spannable, no dice.

To see what you default notepad saves your txt files as, just go to file>>save as, then along the bottom or in some kind of options menu there should be something about character encoding

Java - regex match in String vs. readline in file

I don't understand this strange behaviour of regex match in Java. I'm working in Eclipse...
I have a .txt file encoded in UTF-8, where are blocks of text lines divided by these identifiers:
[a]
.
.
.
[b]
.
.
[c]
and so on...
My program is reading this file with this BufferedReader:
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream("file.txt", Charset.forName("UTF-8"))); }
catch (FileNotFoundException e) { System.out.println("File not found!"); e.printStackTrace(); System.exit(0); }
This should find the first identifier:
String line = "";
while (!line.matches("^\\[.\\]$")) { line = reader.readLine(); }
But it is instantly skipped!
But when I try to test the regex "manually", it works:
String line = "[a]";
if (line.matches("^\\[.\\]$")) { System.out.println("Regex matches"); }
It is possible that it is some trivial problem, but I get totally stuck at this point!
Thanks in advance for reply!
EDIT:
Well, I changed the encoding of text file to "ANSI" and it just started to work fine - OH MY GOD - why?! So there must be problem with the reader - I will try to find it out as soon as possible and edit my question.
So when the encoding of text file is "UTF-8", the regex doesn't match the first line of text file, where is "[a]" and matches the next identifier few lines below. What is wrong?
EDIT 2:
LOL I can't trust the Windows Notepad anymore - I had saved that file in it...and a few moments ago I saved that file using PSPad editor and now it works fine!

First of all FileInputStream does not have a constructor with parameters file and character set (JavaDoc JDK). Maybe you got a typo in your code?
new FileInputStream("file.txt", Charset.forName("UTF-8"))
So is this a different implementation of FileInputStream?
Second I tried this little test with a bracket modification:
BufferedReader reader;
reader = new BufferedReader(new InputStreamReader(
new FileInputStream("target/classes/test.txt"),
Charset.forName("UTF-8")));
String line = "";
while (!line.matches("^\\[.\\]$")) { line = reader.readLine(); }
System.out.println(line);
And it just worked and printed:
[a]

Skip parts while reading and writing a file in Android/Java

I'm trying to learn Java/Android and right now I'm doing some experiments with the replaceAll function. But I've found that with large text files the process gets sluggish so I was wondering if there is a way to skip the "useless" parts of a file to have a better performance. (Note: Just skip them, not delete them)
Note: I am not trying to "count lines" or "println" or "system.out", I'm just replacing strings and saving the changes in the same file.
Example
AAAA
CCCC- 9234802394819102948102948104981209381'238901'2309'129831'2381'2381'23081'23081'284091824098304982390482304981'20841'948023984129048'1489039842039481'204891'29031'923481290381'20391'294872385710239841'20391'20931'20853029573098341'290831'20893'12894093274019799919208310293810293810293810293810298'120931¿2093¿12039¿120931¿203912¿0391¿203912¿039¿12093¿12093¿12093¿12093¿12093¿1209312¿0390¿... DDDD
AAAA
CCCC- 9234802394819102948102948104981209381'238901'2309'129831'2381'2381'23081'23081'284091824098304982390482304981'20841'948023984129048'1489039842039481'204891'29031'923481290381'20391'294872385710239841'20391'20931'20853029573098341'290831'20893'12894093274019799919208310293810293810293810293810298'120931¿2093¿12039¿120931¿203912¿0391¿203912¿039¿12093¿12093¿12093¿12093¿12093¿1209312¿0390¿... DDDD
and so on....like a zillion times
I want to replace all "AAAA" with "BBBB", but there are large portions of data between the strings I am replacing. Also, this portions always begin with "CCCC" and end with "DDDD".
Here's the code I am using to replace the string.
File file = new File("my_file.txt");
BufferedReader reader = new BufferedReader(new FileReader(file));
String line = "", oldtext = "";
while((line = reader.readLine()) != null) {
oldtext += line + "\r\n";
}
reader.close();
// Replacing "AAAA" strings
String newtext= oldtext.replaceAll("AAAA", "BBBB");
FileWriter writer = new FileWriter("my_file.txt");
writer.write(newtext);
writer.close();
I think reading all lines is inefficient, especially when you won't be modifying these parts (and they represent the 90% of the file).
Does anyone know a solution???

You are wasting a lot of time on this line --
oldtext += line + "\r\n";
In Java, String is immutable, which means you can't modify them. Therefore, when you do the concatenation, Java is actually making a complete copy of oldtext. So, for every line in your file, you are recopying every line that came before in your new String. Take a look at StringBuilder for a a way to build a String avoiding these copies.
However, in your case, you do not need the whole file in memory, because you can process line by line. By moving your replaceAll and write into your loop, you can operate on each line as you read it. This will keep the memory footprint of the routine down, because you are only keeping a single line in memory.
Note that since the FileWriter is opened before you read the input file, you need to have a different name for the output file. If you want to keep the same name, you can do a renameTo on the File after you close it.
File file = new File("my_file.txt");
BufferedReader reader = new BufferedReader(new FileReader(file));
FileWriter writer = new FileWriter("my_out_file.txt");
String line = "";
while((line = reader.readLine()) != null) {
// Replacing "AAAA" strings
String newtext= line.replaceAll("AAAA", "BBBB");
writer.write(newtext);
}
reader.close();
writer.close();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Android's BreakIterator considers line breaks as sentence delimiters - java

Related

ISO-8859-1 to UTF-8 in Java (runescape API)

Java charset - How to get correct input from System.in?

Scanner and Buffered reader not showing double quotation marks Java Android

Java - regex match in String vs. readline in file

Skip parts while reading and writing a file in Android/Java

Categories

Resources