XML parsing error related to char encoding set - java

I have an valid XML file(valid cause browser can parse it) that I try to parse using JDOM2. The code was running good for other xml files but for this particular xml file it gives me the following exception on builder.build() line : "com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence. "
My code is as follows
import java.io.*;
import java.util.*;
import java.net.*;
import org.jdom2.*;
import org.jdom2.input.*;
import org.jdom2.output.*;
import org.jdom2.adapters.*;
public class Test
{
public static void main(String st[])
{
String results="N.A.";
SAXBuilder builder = new SAXBuilder();
Document doc;
results = scrapeSite().trim();
try
{
doc = builder.build(new ByteArrayInputStream(results.getBytes()));
}
catch(JDOMException e)
{
System.out.println(e.toString());
}
catch(IOException e)
{
System.out.println(e.toString());
}
}
public static String scrapeSite()
{
String temp="";
try
{
URL url = new URL("http://msu-footprints.org/2011/Aditya/search_5.xml");
URLConnection conn = url.openConnection();
conn.setAllowUserInteraction(false);
InputStream urlStream = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(urlStream));
String t = br.readLine();
while(t!=null)
{
temp = temp + t;
t = br.readLine();
}
}
catch(IOException e)
{
System.out.println(e.toString());
}
return temp;
}
}

why are you reading the xml into a String with a Reader? you are corrupting the xml before you parse it. treat xml as bytes, not chars.
and why are you reading the whole URL InputStream just to convert it into another ByteArrayInputStream? you can reduce that to about 2 lines of code by passing the URL InputStream directly to the builder. (not mention avoid additional memory issues caused by reading the entire stream into memory).

As jtahlborn points out, you should always treat XML as bytes, letting the parser work out the encoding.
But more than that, you should never ever use String.getBytes() to get the bytes of a string: you will not be getting what you think you are.
In this case you can just get the bytes of the site, but even if you were constructing XML in a string and then handing that to a parser as a byte sequence (or, more likely, writing the bytes to a file), you would want to specify the encoding such that it matches the encoding the XML says it's in, which by default is UTF-8:
byte[] bytes = myString.getBytes("UTF-8");
Likewise, if for some reason you needed to use a Writer or Reader, you must specify the encoding to write or read in.
If you need to construct XML, a good way is to use the XMLStreamWriter class:
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
XMLStreamWriter writer =
XMLOutputFactory.newInstance().createXMLStreamWriter(outStream);

Related

Use of Utils.java file

I have this Java servlet API file and in it is a class called utils.java . I can't quite figure out what the use of this piece of code is in the API. This is my first time working on APIs so any help in understanding this would be appreciated.
package implementation;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import javax.servlet.http.HttpServletRequest;
public class Utils {
public static String getBody(HttpServletRequest request) throws IOException {
String body = null;
StringBuilder stringBuilder = new StringBuilder();
BufferedReader bufferedReader = null;
try {
InputStream inputStream = request.getInputStream();
if (inputStream != null) {
bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
char[] charBuffer = new char[128];
int bytesRead = -1;
while ((bytesRead = bufferedReader.read(charBuffer)) > 0) {
stringBuilder.append(charBuffer, 0, bytesRead);
}
} else {
stringBuilder.append("");
}
} catch (IOException ex) {
throw ex;
} finally {
if (bufferedReader != null) {
try {
bufferedReader.close();
} catch (IOException ex) {
throw ex;
}
}
}
body = stringBuilder.toString();
return body;
}
}
And then in other servlets it has been called like this: String req = Utils.getBody(request);
Can someone please explain the working?
The purpose of this method is to read the request body and return it as a String. Basically, it gets hold of the request's input stream, wraps it as a reader (which converts to characters), reads characters from it and appends them to StringBuilder. When it reaches the end of the stream it closes it, and returns the builder's contents as a String.
The code could be simplified a bit. Indeed, in Java 8+, the core code could be replaced with
return bufferedReader.lines().collect(Collectors.joining("\n"))
The clunky handling of the streams could be simplified using Java 7+ try with resources.
The method simplifies to this:
public static String getBody(HttpServletRequest request) throws IOException {
try (InputStream is = request.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
return br.lines().collect(Collectors.joining(System.lineSeparator()));
}
}
There are a couple of issues with this:
It is using the platform default character set to decode the input rather than the character set that may have been specified in the HTTP request header. That problem can be solved by using request.getReader() instead of request.getInputStream().
It is converting the original end-of-line sequences into the platform's standard end-of-line sequences.
If the request's body is extremely large, converting it into a String could fill up the heap, and lead to OOMEs. That could be used as a Denial of Service attack. If this is a concern, the code needs to be more defensive ... or you need to set a request size limit at the web container level.

How can I create a truststore from a base64 encoded String?

I have a String that is encoded in base64, I need to take this string, decode it and create a truststore file, but when I do that, the final file is not valid. Here is my code:
public static void buildFile() {
String exampleofencoded = "asdfasdfasdfadfa";
File file = new File("folder/file.jks");
try (FileOutputStream fos = new FileOutputStream(file);
BufferedOutputStream bos = new BufferedOutputStream(fos);
DataOutputStream dos = new DataOutputStream(bos))
{
Base64.Decoder decoder = Base64.getDecoder();
String decodedString =new String(decoder.decode(exampleofencoded).getBytes());
dos.writeBytes(decodedString);
}
catch (IOException e) {
System.out.println("Error creating file");
}
catch(NullPointerException e) {
System.out.println(e.getMessage();
}
}
The problem is two-fold.
You're converting a byte[] array to String, which is a lossy operation for actual binary data for most character sets (except maybe iso-8859-1).
You're using DataOutputStream, which is not a generic output stream, but intended for a specific serialization format of primitive types. And specifically its writeBytes method comes with an important caveat ("Each character in the string is written out, in sequence, by discarding its high eight bits."), which is one more reason why only using iso-8859-1 will likely work.
Instead, write the byte array directly to the file
public static void buildFile() {
String exampleofencoded = "asdfasdfasdfadfa";
File file = new File("folder/file.jks");
try (FileOutputStream fos = Files.newOutputStream(file.toPath()) {
Base64.Decoder decoder = Base64.getDecoder();
byte[] decodedbytes = decoder.decode(exampleofencoded);
fos.write(decodedbytes);
} catch (IOException e) {
System.out.println("Error creating file");
}
}
As an aside, you shouldn't catch NullPointerException in your code, it is almost always a problem that can be prevented by careful programming and/or validation of inputs. I would usually also advise against catch the IOException here and only printing it. It is probably better to propagate that exception as well, and let the caller handle it.

Problem in reading text from the file using FileInputStream in Java

I have a file input.txt in my system and I want to read data from that file using FileInputStream in Java. There is no error in the code, but still it does not work. It does not display the output. Here is the code, any one help me out kindly.
package com.company;
import java.io.FileInputStream;
import java.io.InputStream;
public class Main {
public static void main(String[] args) {
// write your code here
byte[] array = new byte[100];
try {
InputStream input = new FileInputStream("input.txt");
System.out.println("Available bytes in the file: " + input.available());
// Read byte from the input stream
input.read(array);
System.out.println("Data read from the file: ");
// Convert byte array into string
String data = new String(array);
System.out.println(data);
// Close the input stream
input.close();
} catch (Exception e) {
e.getStackTrace();
}
}
}
Use utility class Files.
Path path = Paths.get("input.txt");
try {
String data = Files.readString(path, Charset.defaultCharset());
System.out.println(data);
} catch (Exception e) {
e.printStackTrace();
}
For binary data, non-text, one should use Files.readAllBytes.
available() is not the file length, just the number of bytes alread buffered by the system; reading more will block while physically reading the disk device.
String.getBytes(Charset) and new String(byte[], Charset) explicitly specify the charset of the actual bytes. String will then keep the text in Unicode, so it may combine all scripts of the world.
Java was designed with text as Unicode, due to the situation then with C and C++. So in a String you can mix Arabic, Greek, Chinese and math symbols. For that binary data (byte[], InputStream, OutputStream) must be given the encoding, Charset, the bytes are in, and then a conversion to Unicode happens for text (String, char, Reader, Writer).
FileInputStream.read(byte[]) requires using the result and just reads one single buffer, must be repeated.

read any file efficiently in java as string

i'm working on a simple implementation of Huffman coding and it works fine for any files using some form of text encoding but when i try to read in any other format (e.g. .mp4 .png .exe) it still works but becomes extremely slow
(minutes instead of less than a second for the same size of file).
my question is is there another method i should be using to read these files so that the read speed depends on the size of the file not its format and if so what is it? thanks.
this is my IO class it uses a fileReader wrapped in a bufferedReader to read files based on a path entered in the console.
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
public class IO {
public String readFile(String path, boolean includeNewLine) {
String returnString = "";
try {
FileReader fileReader = new FileReader(path);
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line;
int nLines = 0;
while((line = bufferedReader.readLine()) != null) {
if(nLines > 0 && includeNewLine) {
returnString += "\n";
}
returnString += line;
nLines++;
}
bufferedReader.close();
} catch(FileNotFoundException e) {
System.out.println("Unable to open file '" + path + "'");
} catch(IOException e) {
System.out.println("Error reading file '" + path + "'");
}
return returnString;
}
}
Maybe this will help: FileInputStream vs FileReader
And, of course, change your method to use StringBuilder (but that's another issue).
With returnString you are creating new instance of String by appending the new line to previous line. Instead i would suggest you use StringBuilder as follows:
StringBuilder fileContent = new StringBuilder();
//do your stuff
fileContent.append(line);
In this way, you keep on reusing the same builder object. Also if you are reading binary content then better use class from InputStream hierarchy.
We do have Files class from nio package which you could use to get lines as below instead:
try (Stream<String> stream = Files.lines( Paths.get(filePath), StandardCharsets.UTF_8)) {
stream.forEach(s -> fileContent.append(s).append("\n"));
}
Another way, would be to use already tested code provided by Apache commons IO api FileUtils.readFileToString
As long as you are trying to interpret the file as a String you'll be running into problems with efficiency. Any binary format may produce a huge string, even exceeding the 64K maximum a string can hold as there may never be a byte you'll interpret as a end of line character ('\n').
You should interpret your file as a sequence of bytes. Use a memory mapped ByteBuffer for maximum efficiency.

File encoding : saved content is different than when read

I have a slight problem trying to save a file in java.
For some reason the content I get after saving my file is different from what I have when I read it.
I guess this is related to file encoding, but without being sure.
Here is test code I put together. The idea is basically to read a file, and save it again.
When I open both files, they are different.
package workspaceFun;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import org.apache.commons.codec.DecoderException;
public class FileSaveTest {
public static void main(String[] args) throws IOException, DecoderException{
String location = "test.location";
File locationFile = new File(location);
FileInputStream fis = new FileInputStream(locationFile);
InputStreamReader r = new InputStreamReader(fis, Charset.forName("UTF-8"));
System.out.println(r.getEncoding());
StringBuilder builder = new StringBuilder();
int ch;
while((ch = fis.read()) != -1){
builder.append((char)ch);
}
String fullLocationString = builder.toString();
//Now we want to save back
FileOutputStream fos = new FileOutputStream("C:/Users/me/Desktop/test");
byte[] b = fullLocationString.getBytes();
fos.write(b);
fos.close();
r.close();
}
}
An extract from the input file (opened as plain text using Sublime 2):
40b1 8b81 23bc 0014 1a25 96e7 a393 be1e
and from the output file :
40c2 b1c2 8bc2 8123 c2bc 0014 1a25 c296
The getEncoding method returns "UTF8". Trying to save the output file using the same charset doest not seem to solve the issue.
What puzzles me is that when I try to read the input file using Hex from apache.commons.codec like this :
String hexLocationString2 = Hex.encodeHexString(fullLocationString.getBytes("UTF-8"));
The String already looks like my output file, not the input.
Would you have any idea on what can go wrong?
Thanks
Extra info for those being interested, I am trying to read an eclipse .location file.
EDIT: I placed the file online so that you can test the code
I believe is the way you are reading the stream.
You are using FileInputStream directly to read the content instead of wrapping it in the InputStreamReader
By using the InputStreamReader you may determine which Charset to use.
Take in consideration that the Charset defined in the InputStream must be the same you expect as InputStream doesn't detect charsets, it just reads them in that specific format.
Try the following changes:
InputStreamReader r = new InputStreamReader(new FileInputStream(locationFile), StandardCharsets.UTF_8);
then instead of fos.read() use r.read()
Finally when writing the String get the bytes in the same Charset as your Reader
FileOutputStream fos = new FileOutputStream("C:/Users/me/Desktop/test");
fos.write(fullLocationString.getBytes(StandardCharsets.UTF_8));
fos.close()
Try to read and write back as below:
public class FileSaveTest {
public static void main(String[] args) throws IOException {
String location = "D:\\test.txt";
BufferedReader br = new BufferedReader(new FileReader(location));
StringBuilder sb = new StringBuilder();
try {
String line = br.readLine();
while (line != null) {
sb.append(line);
line = br.readLine();
if (line != null)
sb.append(System.lineSeparator());
}
} finally {
br.close();
}
FileOutputStream fos = new FileOutputStream("D:\\text_created.txt");
byte[] b = sb.toString().getBytes();
fos.write(b);
fos.close();
}
}
Test file contains both Cirillic and Latin characters.
SDFASDF
XXFsd1
12312
іва

Categories