Java : Skip Unicode characters while reading a file - java

I am reading a text file using the below code,
try (BufferedReader br = new BufferedReader(new FileReader(<file.txt>))) {
for (String line; (line = br.readLine()) != null;) {
//I want to skip a line with unicode character and continue next line
if(line.toLowerCase().startsWith("\\u")){
continue;
//This is not working because i get the character itself and not the text
}
}
}
The text file:
How to skip all the unicode characters while reading a file ?

You can skip all lines that contains non ASCII characters:
if(Charset.forName("US-ASCII").newEncoder().canEncode(line)){
continue;
}

All characters in a String are Unicode. A String is a counted sequence of UTF-16 code units. By "Unicode", you must mean not also in some unspecified set of other character sets. For sake of argument, let's say ASCII.
A regular expression can sometimes be the simplest expression of a pattern requirement:
if (!line.matches("\\p{ASCII}*")) continue;
That is, if the string does not consist only of any number, including 0, (that's what * means) of "ASCII" characters, then continue.
(String.matches looks for a match on the whole string, so the actual regular expression pattern is ^\p{ASCII}*$. )

Something like this might get you going:
for (char c : line.toCharArray()) {
if (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.BASIC_LATIN) {
// do something with this character
}
}
You could use that as a starting point to either discard each non-basic character, or discard the entire line if it contains a single non-basic character.

Related

How to check if on the end of line is \n or \r or \r\n in JAVA

I need to check every charackter in file and cast it on byte. But unfortunetely scanner not gives any possibilities to not spliting last charackter of line...
I try to do something like this :
Scanner in = new Scanner(new File(path));
List<Byte> byteList = new ArrayList<>();
while (in.hasNextLine()) {
String a = in.nextLine();
if (in.hasNextLine()) {
a = a + (char) (13);
}
for (char c : a.toCharArray()) {
byteList.add((byte) c);
}
}
byte[] bytes = new byte[byteList.size()];
for (int i = 0; i < byteList.size(); i++) {
bytes[i] = byteList.get(i);
}
return bytes;
}
Have you maybe any idea for the solution on this problem ?
I'll be grateful for your help.
You cannot do this with Scanner.readLine() or BufferedReader.readLine() because both of these APIs consume the line separators.
You could conceivably do it using Scanner.next() with a custom separator regex that causes the line separators to be included in the tokens. (Hint: using a look-behind.)
However for what you are actually doing in the code, either a FileInputStream or a FileReader would be better.
This brings me to another thing.
What is this code supposed to do?
What it actually does is to convert Unicode code units into bytes by throwing away the top bits. That might make sense if the input charset was ASCII or (maybe) LATIN-1. But for anything else, it is probably going to mangle the text.
If you are trying read the file as (raw) bytes, simply use FileInputStream + BufferedInputStream. Then read / process the bytes directly. The line terminators won't require any special handling.
If you are trying to read the file as encoded characters in some charset and transliterate it to another one (e.g. ASCII). You should be writing to a FileWriter + BufferedWriter. Once again, line separator / terminator characters will be preserved ... and you can "normalize" them it you want to.
If you are doing something else ... well this is probably not the right way to do it. A List<Byte> is going to be inefficient and difficult to convert to something that other Java APIs can deal with directly.
Read the whole file, including all line endings, in as a single string:
String fileStr = in.useDelimiter("\\A").next();
The regex \A matches start of input, which is never encountered, so the entire input stream is returned from next().
If your situation requires all line endings to be corrected to a specific line ending, despite whatever the file contains, do this:
fileStr = fileStr.replaceAll("\\R", "\n");
The regex \R matches all types of line endings.
Of course this can all be done as 1 line:
String fileStr = in.useDelimiter("\\A").next().replaceAll("\\R", "\n");

reading text file java

I'm trying to read a text file(.txt) in java. I need to eventually put the text I extract word by word in a binary tree's nodes . If for example, I have the text: "Hi, I'm doing a test!", I would like to split it into "Hi" "I" "m" "doing" "a" "test", basically skipping all punctuation and empty spaces and considering a word to be a sequence of contiguous alphabet letters. I am so far able to extract the words and put them in an array for testing. However, if I have a completely empty line in my .txt file, the code will consider it as a word and return an empty space. Also, punctuation at the end of a line works but if there's a comma for example and then text, I will get an empty space as well ! Here is what I tried so far:
public static void main(String[] args) throws Exception
{
FileReader file = new FileReader("File.txt");
BufferedReader reader = new BufferedReader(file);
String text = "";
String line = reader.readLine();
while (line != null)
{
text += line;
line = reader.readLine();
}
System.out.println(text);
String textnospaces=text.replaceAll("\\s+", " ");
System.out.println(textnospaces);
String [] tokens = textnospaces.split("[\\W+]");
for(int i=0;i<=tokens.length-1;i++)
{
tokens[i]=tokens[i].toLowerCase();
System.out.println(tokens[i]);
}
}
Using the following text:
I can't, come see you. Today my friend is hard
s
I get the following output:
i
can
t
(extra space between "t" and "come")
come
see
you
(extra space again)
today
my
friend
is
hards
Any help would be appreciated ! Thanks
use the trim() method of String. From documentation http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#trim%28%29:
"Returns a copy of the string, with leading and trailing whitespace omitted.
If this String object represents an empty character sequence, or the first and last characters of character sequence represented by this String object both have codes greater than '\u0020' (the space character), then a reference to this String object is returned.
Otherwise, if there is no character with a code greater than '\u0020' in the string, then a new String object representing an empty string is created and returned.
Otherwise, let k be the index of the first character in the string whose code is greater than '\u0020', and let m be the index of the last character in the string whose code is greater than '\u0020'. A new String object is created, representing the substring of this string that begins with the character at index k and ends with the character at index m-that is, the result of this.substring(k, m+1).
This method may be used to trim whitespace (as defined above) from the beginning and end of a string.
Returns:
A copy of this string with leading and trailing white space removed, or this string if it has no leading or trailing white space."
If you really are just looking for each contiguous sequence of characters, you can accomplish this with regex matching quite simply.
String patternString1 = "([a-zA-Z]+)";
String text = "I can't, come see you. Today my friend is hard";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("found: " + matcher.group(1));
}

Cast arbitrary escaped character to int

I have a method that, at the end, takes a character array (with one element), and returns the cast of that character:
char[] first = {'a'};
return (int)first[0];
However, sometimes I have character arrays with two elements, where the first is always a "\" (i.e. it is a character array that "contains" an escaped character):
char second = {'\\', 'n'};
I would like to return (int)'\n', but I do not know how to convert that array into a single escaped character. I am okay checking whether or not the array is of length 1 or 2, but I really don't want to have a long switch or if/else block to go through every possible escaped character.
How about making an HashMap of escape character vs the second ? like:
Map<Character, int> escapeMap = new HashMap<>();
escapeMap.put('n', 10);
Then make something like:
If (second[0] == '\\') {
return escapeMap.get(second[1]);
}
else
{
return (int)first[0];
}
You could use a map to store the escape sequences mappings to their corresponding characters. If you assume that the escape sequence will always be just one character with the code below 128, you could simplify the mappings to something like this:
char[] escaped = {..., '\n', ...'\t', ...}
where the character '\n' is on the (int)'n'-th position of the array.
Then you would find the the escaped character just by escaped[(int)second[1]]. You just need to check the array bounds, if an invalid escape sequence is found.
Here is an ugly hack. This works for me, but appears to be unreliable, buggy and time-consuming (and hence don't use this in critical parts).
char[] second = {'\\','n'};
String s = new String(second);
//write the String to an OutputStream
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(baos));
writer.write("key=" + s);
writer.close();
//load the String using Properties
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
Properties prop = new Properties();
prop.load(bais);
baos.close();
bais.close();
//now get the character
char c = prop.getProperty("key").charAt(0);
System.out.println((int)c);
output: 10 (which is the output of System.out.println((int)'\n');)
When printing (int)'a' the decimal value of that character from the UniCode table is being printed. In the case of a that would be 93. http://unicode-table.com/en/
\n is being identified as (Uni)code LF which means Line Feed or New Line. In the Unicode table thats the same as decimal number 10. Which indeed gets printed if you write: System.out.println((int)'\n');
Same goes for the characters \b , \f , \r , \t' ,\',\"and \\ which have special meaning for the compiler and have a special character code like LF for \n. Look them up if you want to know the details.
In that light the most simplest solution would be:
char[] second = {'\\', 'n'};
if (second.length > 1) {
System.out.println((int)'\n');
} else {
System.out.println(second[0]);
}
and thats only if \n is the only escape sequence you encounter.

Reading from InputStream until double quotation marks

Need help reading from InputStream to a list of bytes until quotation marks.
The problem is, InputStream reads bytes and I'm not sure how to stop it reading when it reaches quotation marks ... I thought about something like this:
public static List<Byte> getQuoted(InputStream in) throws IOException {
int c;
LinkedList<Byte> myList = new LinkedList<>();
try {
while ((in.read()) != "\"") { ?????
list.add(c)
.....
The while condition is a problem, of course the quotation marks are String while int is expected.
"\"" is a String. If you want just the character representation of ", use '"' instead.
Note that your code will not work as you expect if your file is not in ASCII format (and the behaviour will be inconsistent between different character sets) (it does of course depend what you expect).
If in ASCII, each character will take up a single byte in the file and InputStream::read() reads a single byte (thus a single ASCII character) so everything will work fine.
If in a character set that takes up more than 1 byte per character (e.g. Unicode), each read will read less than a single character and your code will probably not work as expected.
Reader::read() (and using Character rather than Byte) is advised since it will read a character, not just a byte.
Also, you're missing an assignment:
while ((in.read()) != '"')
should be
while ((c = in.read()) != '"')

Java: Count empty lines in a text file/string

I am using the following code to count empty lines in Java, but this code returns a greater number of empty lines than there are.
int countEmptyLines(String s) {
int result=0;
Pattern regex = Pattern.compile("(?m)^\\s*$");
Matcher testMatcher = regex.matcher(s);
while (testMatcher.find())
{
result++;
}
return result;}
What am I doing wrong or is there a better way to do it?
Try this:
final BufferedReader br = new BufferedReader(new StringReader("hello\n\nworld\n"));
String line;
int empty = 0;
while ((line = br.readLine()) != null) {
if (line.trim().isEmpty()) {
empty++;
}
}
System.out.println(empty);
I found a way to fix my own regex while I was at lunch:
Pattern regex = Pattern.compile("(?m)^\\s*?$");
The '?' makes the \s* reluctant, meaning it will somehow not match the character that '$' will match.
\s matches any whitespace, which is either a space, a tab or a carriage return/linefeed.
The easiest way to do this is to count chains of successive EOL characters. I write EOL, because you need to determine which character denotes the end of line in your file. While under Windows, an end of line amounts to a Carriage Return and a Linefeed character. Under Unix, this is different, so for a file written under Unix your programm will have to be adjusted.
Then, count every the successive number of the end of line character(s) and each time add this number minus 1 to a count. At the end, you will have the empty line count.

Categories