I am trying to download web page with all its resources . First i download the html, but when to be sure to keep file formatted and use this function below .
there is and issue , i found 10 in the final file and when i found that hexadecimal code of the LF or line escape . and this makes troubles to my javascript functions .
Example of the final result :
<!DOCTYPE html>10<html lang="fr">10 <head>10 <meta http-equiv="content-type" content="text/html; charset=UTF-8" />10
Can someone help me to found the real issue ?
public static String scanfile(File file) {
StringBuilder sb = new StringBuilder();
try {
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
while (true) {
String readLine = bufferedReader.readLine();
if (readLine != null) {
sb.append(readLine);
sb.append(System.lineSeparator());
Log.i(TAG,sb.toString());
} else {
bufferedReader.close();
return sb.toString();
}
}
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
There are multiple problems with your code.
Charset error
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
This isn't going to work in tricky ways.
Files (and, for that matter, data given to you by webservers) comes in bytes. A stream of numbers, each number being between 0 and 255.
So, if you are a webserver and you want to send the character ö, what byte(s) do you send?
The answer is complicated. The mapping that explains how some character is rendered in byte(s)-form is called a character set encoding (shortened to 'charset').
Anytime bytes are turned into characters or vice versa, there is always a charset involved. Always.
So, you're reading a file (that'd be bytes), and turning it into a Reader (which is chars). Thus, charset is involved.
Which charset? The API of new FileReader(path) explains which one: "The system default". You do not want that.
Thus, this code is broken. You want one of two things:
Option 1 - write the data as is
When doing the job of querying the webserver for the data and relaying this information onto disk, you'd want to just store the bytes (after all, webserver gives bytes, and disks store bytes, that's easy), but the webserver also sends the encoding, in a header, and you need to save this separately. Because to read that 'sack of bytes', you need to know the charset to turn it into characters.
How would you do this? Well, up to you. You could for example decree that the data file starts with the name of a charset encoding (as sent via that header), then a 0 byte, and then the data, unmodified. I think you should go with option 2, however
Option 2
Another, better option for text-based documents (which HTML is), is this: When reading the data, convert it to characters, using the encoding as that header tells you. Then, to save it to disk, turn the chars back to bytes, using UTF-8, which is a great encoding and an industry standard. That way, when reading, you just know it's UTF-8, period.
To read a UTF-8 text file, you do:
Files.newBufferedReader(Paths.get(file));
The reason this works, is that the Files API, unlike most other APIs (and unlike FileReader, which you should never ever use), defaults to UTF_8 and not to platform-default. If you want, you can make it more readable:
Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8);
same thing - but now in the code it is clear what's happening.
Broken exception handling
} catch (IOException e) {
e.printStackTrace();
return null;
}
This is not okay - if you catch an exception, either [A] throw something else, or [B] handle the problem. And 'log it and keep going' is definitely not 'handling' it. Your strategy of exception handling results in 1 error resulting in a thousand things going wrong with a thousand stack traces, and all of them except the first are undesired and irrelevant, hence why this is horrible code and you should never write it this way.
The easy solution is to just put throws IOException on your scanFile method. The method inherently interacts with files, it SHOULD be throwing that. Note that your psv main(String[] args) method can, and usually should, be declared to throws Exception.
It also makes your code simpler and shorter, yay!
Resource Management failure
a filereader is a resource. You MUST close it, no matter what happens. You are not doing that: If .readLine() throws an exception, then your code will jump to the catch handler and bufferedReader.close is never executed.
The solution is to use the ARM (Automatic Resource Management) construct:
try (var br = Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8)) {
// code goes here
}
This construct ensures that close() is invoked, regardless of how the 'code goes here' block exits. Even if it 'exits' via an exception or a return statement.
The problem
Your 'read a file and print it' code is other than the above three items mostly fine. The problem is that the HTML file on disk is corrupted; the error lies in your code that reads the data from the web server and saves it to disk. You did not paste that code.
Specifically, System.lineSeparator() returns the actual string. Thus, assuming the code you pasted really is the code you are running, if you are seeing an actual '10' show up, then that means the HTML file on disk has that in there. It's not the read code.
Closing thoughts
More generally the job of 'just print a file on disk with a known encoding' can be done in far fewer lines of code:
public static String scanFile(String path) throws IOException {
return Files.readString(Paths.get(path));
}
You should just use the above code instead. It's simple, short, doesn't have any bugs, cannot leak resources, has proper exception handling, and will use UTF-8.
Actually, there is no problem in this function I was mistakenly adding 10 using another function in my code .
I am getting this error:
Caused by: java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\Users\Emre\Desktop\PN1g1z.gif
And I really don't get what's wrong.
This is what throws the exception:
Media media = new Media(file.getAbsolutePath());
Media expects an URI as String in the constructor. So, instead of using File#getAbsolutePath(), you should be using File#toURI() instead.
https://docs.oracle.com/javase/7/docs/api/java/io/File.html#toURI%28%29
From the Media#new JavaDoc (thanks #Andreas):
source - The URI of the source media.
Actually, it's a big problem where you put your server.
I already faced this problem before. I used Geronimo with the space in my direction D:\Common DevTool\Geronimo.
You have two ways to resolve:
Change to D:\Tool\Geronimo. It ran well.
Your directory is not correct: C:/Program Files. You should move your server to another place without the space in the name.
Upgrade the JSF version.
I had a similar problem but another exception: java.lang.IllegalArgumentException: Illegal character in opaque part at index 2: C:\Users\MyUser\project\src\main\packagex\file.csv
The project was created using java not by me. It tried to read a csv file. It worked ok on linux but crashed on windows 10.
My solution was change the absolute path to relative and switch reverse slash to "normal" slash. The url finally worked with this:
src/main/packagex/file.csv
It seems colon and reverse slash triggered the exception. That is the reason exception indicates index 2 because colon is located in second position at url
Illegal character in opaque part at index 2: C:\Users\MyUser...
Reference:
https://background.sysfactory.online/index.php/2022/09/23/solucion-java-lang-illegalargumentexception-illegal-character-in-opaque-part-at-index-2-cusersmyuser/
I am using a class to crunch XML feeds (RSS feeds like: http://www.reddit.com/r/carporn/.rss) into JSONObjects for easy processing. Normally this class works perfect for every feed I give it. Strangely, trying to use Reddit's feeds, which are perfectly valid XML per W3C validators, I get the error:
E/JSON exception﹕ Missing ';' in XML entity: & at character 21607
I threw the feed into Notepad++ and went to character 21607 and found:
"
This appears to be a perfectly valid encoding for XML purposes of the double quote character: ". W3C took the same input and passed 0 warnings or errors, the XML is definitely completely valid.
So, why is XML.toJSONObject failing on valid XML? I've noted it also fails when confronted with:
'
I can't believe some rookie like me is finding a bug, so what's really going on here?
Thank you!
Ultimately, I fixed this problem by doing the following HACK, I'd still like to know why this is necessary:
/*
Replaces the double-quote and single-quote values below with the actual characters
*/
feedsRssResult = feedsRssResult.replaceAll(""", "\"");
feedsRssResult = feedsRssResult.replaceAll("'", "'");
It has been at least 5 applications in which I have attempted to display UTF8 encoded characters and every time, quite sporadically and rarely I see random characters being replaced by diamond question marks (see image for better details).
I enclose a page layout to demonstrate my issues. The layout is very basic, it is very simple poll I am creating. The "Съгласен съм" text is takes from a database, where it has just been inserted by a script, using copy-pasted constant. The text is displayed in TextViews.
Has anyone ever encountered such an issue? Please advise!
EDIT: Something I forgot to mention is that the amount and position of weird characters varies on diffferent Android Phone models.
Finally I got it all sorted out in all my applications. Actually the issues mlet down to 3 different reasons and I will list all of them below so that this findings of mine could help people in the future.
Reason 1: Incorrect encoding of user created file.
This actually was the problem with the application I posted about in the question. The problem was that the encoding of the insert script I used for introducing the values in the database was "UTF8 without BOM". I converted this encoding to "UTF8" using Notepad++ and reinserted the values in the database and the issue was resolved. Thanks to #user3249477 for pointing me to thinking in this direction. By the way "UTF8 without BOM" seems to be the default encoding Eclipse uses when creating URF8 files, so take care!
Reason 2: Incorrect encoding of generated file.
The problem of reason 1, pointed me to what to think for in some of the other cases I was facing. In one application of mine I am provided with raw data that I insert in my backend database using simple Java application. The problem there turned out to be that I was passing through intermediate format, files stored on the file system that ?I used to verify I interpretted the raw data correctly. I noticed that these files were also created "UTF8 without BOM". I used this code to write to these files:
BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(outputFilePath));
writer = new BufferedWriter(new OutputStreamWriter(outputStream, STRING_ENCODING));
writer.append(string);
Which I changed to:
BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(outputFilePath));
writer = new BufferedWriter(new OutputStreamWriter(outputStream, STRING_ENCODING));
// prepending a bom
writer.write('\ufeff');
writer.append(string);
Following the prescriptions from this answer. This line I add basically made all the intermediate files be encoded in "UTF8" with BOM and resolved my encoding issues.
Reason 3: Incorrect parsing of HTTP responses
The last issue I encountered in few of my applications was that I was not interpretting the UTF8 http responses correctly. I used to have the following code:
HttpResponse response = httpClient.execute(host, request, (HttpContext) null);
String responseBody = null;
responseBody = IOHelper.getInputStreamContents(responseStream);
Where IOHelper is an util I have written myself and reads stream contents to String. I replaced this code with the already provided method in the Android API:
HttpResponse response = httpClient.execute(host, request, (HttpContext) null);
String responseBody = null;
if (response.getEntity() != null) {
responseBody = EntityUtils.toString(response.getEntity(), HTTP.UTF_8);
}
And this fixed the encoding issues I was having with HTTP responses.
As conclusion I can say that one needs to take special care of BOM / without BOM strings when using UTF8 encoding in Android. I am very happy I learnt so many new things during this investigation.
Here I have the following bit of code taken from this oracle java tutorial:
// Defaults to READ
try (SeekableByteChannel sbc = Files.newByteChannel(file)) {
ByteBuffer buf = ByteBuffer.allocate(10);
// Read the bytes with the proper encoding for this platform. If
// you skip this step, you might see something that looks like
// Chinese characters when you expect Latin-style characters.
String encoding = System.getProperty("file.encoding");
while (sbc.read(buf) > 0) {
buf.rewind();
System.out.print(Charset.forName(encoding).decode(buf));
buf.flip();//LINE X
}
} catch (IOException x) {
System.out.println("caught exception: " + x);
So basically I do not get any output out of it.
I have tried to put some flags in the while loop to check whether or not it gets into, and it gets into. I also changed the encoding in Charset.defaultCharset().decode(buf), result : no output.
Of course there is text in the file passed to newByteChannel(file);
Any idea?
Thanks a lot in advance.
**
EDIT:
** Solved, it was just the file I was trying to access that had been previously accidentally corrupted. After having changed file, everything is working.
The code looks wrong. Try changing the rewind() to flip(), and the flip() to compact().