Reading file to String in Java results in invisible characters

Reading file to String in Java results in invisible characters - java

I'm having trouble around reading from a text file into a String in Java. I have a text file (created in Eclipse, if that matters) that contains a short amount of text -- approximately 98 characters. Reading that file to a String via several methods results in a String that is quite a bit longer -- 1621 characters. All but the relevant 98 are invisible in the debugger/console.
I've tried the following methods to load the String:
apache commons-io:
FileUtils.readFileToString(new File(path));
FileUtils.readFileToString(new File(path), "UTF-8");
byte[] b = FileUtils.readFileToByteArray(new File(path);
new String(b, "UTF-8");
byte[] b = FileUtils.readFileToByteArray(new File(path);
Charset.defaultCharset().decode(ByteBuffer.wrap(bytes)).toString();
NIO:
new String(Files.readAllBytes(path);
And so on.
Is there a method to strip away these control chars? Is there a way to read files to strings that doesn't have this issue?
As noted in the comments below, this behavior is due to a corrupted(?) file generated by Eclipse. I'd still be interested in hearing any strategies for trimming away control characters from Strings, though!

If you want to strip out all non-printable characters, try this
str = str.replaceAll("[^\\p{Graph}\n\r\t ]", "");
The regex matches all "invisible" characters, except ones we want to keep; in this case newline chars, tabs and spaces.
\p{Graph} is a POSIX character class for all printable/visible characters. To negate a POSIX character class, we can use capital P, ie P{Graph} (all non-printable/invisible characters), however we need to not exclude newlines etc, so we need [^\\p{Graph}\n\r\t] .

Read it line by line into a StringBuilder, and then convert it to a String:
StringBuilder sb = new StringBuilder();
BufferedReader file = new BufferedReader(new FileReader(fileName));
while (true)
{
String line = file.readLine();
if (line == null)
break;
sb.append(line+"\n");
}
file.close();
return sb.toString();

Related

Java byte array replace all occurrences of byte-array/string

Is there any "already-implemented" (not manual) way to replace all occurrences of single byte-array/string inside byte array ? I have a case where i need to create byte array containing platform dependent text (Linux (line feed), Windows (carriage return + line feed)). I know such task can be implemented manually but i am looking for out-of-the-box solution. Note that these byte array's are large and solution needs to be performance wise in my case. Also note that i am processing large amount of these byte-arrays.
My current approach:
var byteArray = resourceLoader.getResource("classpath:File.txt").getInputStream().readAllBytes();
byteArray = new String(byteArray)
.replaceAll((schemeModel.getOsType() == SystemTypes.LINUX) ? "\r\n" : "\n",
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\n" : "\r\n"
).getBytes(StandardCharsets.UTF_8);
This approach is not performance wise because of creating new Strings and using regex to find occurrences. I know that manual implementation would require looking at sequence of bytes because of Windows encoding. Manual implementation would therefore also require reallocation (if needed) as well.
Appache common lang utils contains ArrayUtils which contains method
byte[] removeAllOccurrences(byte[] array, byte element). Is there any third party library which contains similar method for replacing ALL byte-arrays/strings occurrences inside byte array ??
Edit: As #saka1029 mentioned in comments, my approach doesn't work for Windows OS type. Because of this bug i need to stick with regexes as following:
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\\r\\n" : "[?:^\\r]\\n",
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\n" : "\r\n")
This way, for windows case, only occurrences of '\n' without preceding '\r' are searched and replaced with '\r\n' (regex is modified to find group at '\n' not at [^\r]\n position directly otherwise last letter from line would be extracted as well). Such workflow cannot be implemented using conventional methods thus invalidates this question.

If you’re reading text, you should treat it as text, not as bytes. Use a BufferedReader to read the lines one by one, and insert your own newline sequences.
String newline = schemeModel.getOsType() == SystemTypes.LINUX ? "\n" : "\r\n";
OutputStream out = /* ... */;
try (Writer writer = new BufferedWriter(
new OutputStreamWriter(out, StandardCharsets.UTF_8));
BufferedReader reader = new BufferedReader(
new InputStreamReader(
resourceLoader.getResource("classpath:File.txt").getInputStream(),
StandardCharsets.UTF_8))) {
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(newline);
}
}
No byte array needed, and you are using only a small amount of memory—the amount needed to hold the largest line encountered. (I rarely see text with a line longer than one kilobyte, but even one megabyte would be a pretty small memory requirement.)
If you are “fixing” zip entries, the OutputStream can be a ZipOutputStream pointing to a new ZipEntry:
String newline = schemeModel.getOsType() == SystemTypes.LINUX ? "\n" : "\r\n";
ZipInputStream oldZip = /* ... */;
ZipOutputStream newZip = /* ... */;
ZipEntry entry;
while ((entry = oldZip.getNextEntry()) != null) {
newZip.putNextEntry(entry);
// We only want to fix line endings in text files.
if (!entry.getName().matches(".*\\." +
"(?i:txt|x?html?|xml|json|[ch]|cpp|cs|py|java|properties|jsp)")) {
oldZip.transferTo(newZip);
continue;
}
Writer writer = new BufferedWriter(
new OutputStreamWriter(newZip, StandardCharsets.UTF_8));
BufferedReader reader = new BufferedReader(
new InputStreamReader(oldZip, StandardCharsets.UTF_8));
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(newline);
}
writer.flush();
}
Some notes:
Are you deliberately ignoring Macs (and other operating systems which are neither Windows nor Linux)? You should assume \n for everything except Windows. That is, schemeModel.getOsType() == SystemTypes.WINDOWS ? "\r\n" : "\n"
Your code contains new String(byteArray) which assumes the bytes of your resource use the default Charset of the system on which your program is running. I suspect this is not what you intended; I have added StandardCharsets.UTF_8 to the construction of the InputStreamReader to address this. If you really meant to read the bytes using the default Charset, you can remove that second constructor argument.

ISO-8859-1 to UTF-8 in Java (runescape API)

I am trying to make a Discord bot which gets informatie from the Runescape API and returns information about the user. The issue i have is when a username has a space involved.
The runescape api gives a file in ISO-8859-1 and i try to convert it to UTF-8
2 examples from the file: lil Jimmy and lil jessica.
The loop finds a match for jessica, but not for jimmy.
The code for getting and reading the file:
InputStream input = null;
InputStreamReader inputReader = null;
BufferedReader reader = null;
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
input = url.openConnection().getInputStream();
inputReader = new InputStreamReader(input, "ISO-8859-1");
reader = new BufferedReader(inputReader);
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
Does anyone know what im doing wrong? Thank you in advance for taking the time to help!
Edit 1: I've added the "ISO-8859-1" to inputReader as told by the answers. Now the next step is to replace the non wrapping white space with regular whit spaces.
Edit 2: The non breaking whitespace can be solved by:
parts[0] = parts[0].replaceAll("\u00a0","aaaaaaaaa");
parts[0] = parts[0].replaceAll("\u00C2","bbbbbbbbb");
parts[0] = parts[0].replaceAll("bbbbbbbbbaaaaaaaaa", " ");
The aaaaaa replaces the nonbreaking space for a regular one, and the aaaaa removes the roman a (Â) it places in front of the whitespace.
Thanks everyone for helping me out!

If you want to ensure that you're reading the data correctly, use:
inputReader = InputStreamReader(input, "ISO-8859-1");
After that, I'm not sure why you're trying to convert to UTF-8, since you're just using the text as Strings from that point on. A string itself doesn't have an encoding. (Well, in a certain sense a Java string is like UTF-16 in its internal representation, but that's a whole other can of worms you don't need to worry about here.)

First you are not providing the charset in your InputStreamReader which cause it to use the default charset instead of the one it should be using, and then you are doing crazy stuff to try and fix it that you shouldn't have to do and that won't work properly.
Also you are not closing the opened stream, you should be using try-with-resources.
It should probably look more like this:
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
try(BufferedReader inputReader = new BufferedReader(new InputStreamReader(url.openConnection().getInputStream(), StandardCharsets.ISO_8859_1))) {
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
}
}

Looking at the downloaded text file:
The whitespace for "lil jessica" is a regular space (U+0020), the one for "lil Jimmy" (and most of the others as well) is a non-breaking space (U+00A0).
If you don't care for breaking or non-breaking, the easiest approach is probably to replace it with a regular white space in your input string. Something like:
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
parts[0] = parts[0].replaceAll("\u00a0"," ");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}

Scanner unable to capture last newline character if last line is empty

I am taking a the programming class where we have to compress a file using a Huffman Tree and decompress it.
I am running into a problem where I am unable to capture the last newline character of a txt file.
E.G.
This is a line
This is a secondline
//empty line
So if I compress and decompress the above text in a file, I end up with a file with this
This is a line
This is a secondline
Right now I'm doing
while(Scanner.hasNextLine()){
char[] cArr = file.nextLine().toCharArray();
//count amount of times a character appears with a hashmap
if(file.hasNextLine()){
//add an occurrence of \n to the hashmap
}
}
I understand the problem is that the last line technically does not have a "Scanner.hasNextline()" since I just consumed the last '\n' of the file with the nextLine() call.
Upon realizing that I have tried doing useDelimiter("") and Scanner.next() instead of Scanner.nextLine() and both still lead to similar problems.
So is there a way to fix this?
Thanks in advance.

Not to completely change your code or approach, using StringBuilder seems to work well.
File testfile = new File("test.txt");
StringBuilder stringBuffer = new StringBuilder();
try{
BufferedReader reader = new BufferedReader(new FileReader(testfile));
char[] buff = new char[500];
for (int charsRead; (charsRead = reader.read(buff)) != -1; ) {
stringBuffer.append(buff, 0, charsRead);
}
}
catch(Exception e){
System.out.print(e);
}
System.out.println(stringBuffer);
Checking the total bytes read:
System.out.println(stringBuffer.length()); // 51
List of file size:
ls -l .
51 test.txt
Bytes read match, so it appears it got all lines including the blank line.
note: I use Java 6, modify to suit your version.
Hope this helps.

Skip parts while reading and writing a file in Android/Java

I'm trying to learn Java/Android and right now I'm doing some experiments with the replaceAll function. But I've found that with large text files the process gets sluggish so I was wondering if there is a way to skip the "useless" parts of a file to have a better performance. (Note: Just skip them, not delete them)
Note: I am not trying to "count lines" or "println" or "system.out", I'm just replacing strings and saving the changes in the same file.
Example
AAAA
CCCC- 9234802394819102948102948104981209381'238901'2309'129831'2381'2381'23081'23081'284091824098304982390482304981'20841'948023984129048'1489039842039481'204891'29031'923481290381'20391'294872385710239841'20391'20931'20853029573098341'290831'20893'12894093274019799919208310293810293810293810293810298'120931¿2093¿12039¿120931¿203912¿0391¿203912¿039¿12093¿12093¿12093¿12093¿12093¿1209312¿0390¿... DDDD
AAAA
CCCC- 9234802394819102948102948104981209381'238901'2309'129831'2381'2381'23081'23081'284091824098304982390482304981'20841'948023984129048'1489039842039481'204891'29031'923481290381'20391'294872385710239841'20391'20931'20853029573098341'290831'20893'12894093274019799919208310293810293810293810293810298'120931¿2093¿12039¿120931¿203912¿0391¿203912¿039¿12093¿12093¿12093¿12093¿12093¿1209312¿0390¿... DDDD
and so on....like a zillion times
I want to replace all "AAAA" with "BBBB", but there are large portions of data between the strings I am replacing. Also, this portions always begin with "CCCC" and end with "DDDD".
Here's the code I am using to replace the string.
File file = new File("my_file.txt");
BufferedReader reader = new BufferedReader(new FileReader(file));
String line = "", oldtext = "";
while((line = reader.readLine()) != null) {
oldtext += line + "\r\n";
}
reader.close();
// Replacing "AAAA" strings
String newtext= oldtext.replaceAll("AAAA", "BBBB");
FileWriter writer = new FileWriter("my_file.txt");
writer.write(newtext);
writer.close();
I think reading all lines is inefficient, especially when you won't be modifying these parts (and they represent the 90% of the file).
Does anyone know a solution???

You are wasting a lot of time on this line --
oldtext += line + "\r\n";
In Java, String is immutable, which means you can't modify them. Therefore, when you do the concatenation, Java is actually making a complete copy of oldtext. So, for every line in your file, you are recopying every line that came before in your new String. Take a look at StringBuilder for a a way to build a String avoiding these copies.
However, in your case, you do not need the whole file in memory, because you can process line by line. By moving your replaceAll and write into your loop, you can operate on each line as you read it. This will keep the memory footprint of the routine down, because you are only keeping a single line in memory.
Note that since the FileWriter is opened before you read the input file, you need to have a different name for the output file. If you want to keep the same name, you can do a renameTo on the File after you close it.
File file = new File("my_file.txt");
BufferedReader reader = new BufferedReader(new FileReader(file));
FileWriter writer = new FileWriter("my_out_file.txt");
String line = "";
while((line = reader.readLine()) != null) {
// Replacing "AAAA" strings
String newtext= line.replaceAll("AAAA", "BBBB");
writer.write(newtext);
}
reader.close();
writer.close();

java read properties and xml file using stringbuilder

I need to read a set of xml and property files and parse the data. Currently I am using inputstream ans string builder to do this. But this does not create the file in the same way as input file is. I donot want to remove the white spaces and new lines. How do i achieve this.
is = test.getInputStream();
br = new BufferedReader(new InputStreamReader(is));
String line5;
StringBuilder sb5 = new StringBuilder();
while ((line5 = br.readLine()) != null) {
sb5.append(line5);
}
String s = sb5.toString();
My output is:
#test 123 #test2 345
Expected output is:
#test
123
#test2
345
Any thoughts ? Thanks

br.readLine() consumes the line breaks, you need to add them to your StringBuilder after appending the line.
is = test.getInputStream();
br = new BufferedReader(new InputStreamReader(is));
String line5;
StringBuilder sb5 = new StringBuilder();
while ((line5 = br.readLine()) != null) {
sb5.append(line5);
sb5.append("\n");
}
If you want an extremely simple solution for reading a file to a String, Apache Commons-IO has a method for performing such a task (org.apache.commons.io.FileUtils).
FileUtils.readFileToString(File file, String encoding);

readLine() method doesn't add the EOL character (\n). So while appending the string to the builder, you need to add the EOL char, like sb5.append(line5+"\n");

The various readLine methods discard the newline from the input.
From the BufferedReader docs:
Returns: A String containing the contents of the line, not including any line-termination characters, or null if the end of the stream has been reached
A solution may be as simple as adding back a newline to your StringBuilder for every readLine: sb5.append(line5 + "\n");.
A better alternative is to read into an intermediate buffer first, using the read method, supplying your own char[]. You can still use StringBuilder.append, and get a String will match the file contents.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading file to String in Java results in invisible characters - java

Related

Java byte array replace all occurrences of byte-array/string

ISO-8859-1 to UTF-8 in Java (runescape API)

Scanner unable to capture last newline character if last line is empty

Skip parts while reading and writing a file in Android/Java

java read properties and xml file using stringbuilder

Categories

Resources