SAXParseException question (how to strip any BOM characters from an xml file) - java

I have some data in an xml file and I am using the Process library to parse thru that file. I ran into the BOM marker issue, that caused some errors to be thrown. I found a work around elsewhere, which is very slow: I'm using Apache Commons BOMInputStream to read the file as a bunch of bytes, after skipping the ones that represent that BOM data.
I think that the source of my problem is actually my lack of knowledge about streams, readers and writers. There are so many different readers and writers and all kinds of "streams" (a word I barely understand) that I want to pull my hair out trying to figure out which one to use and how. I think I just picked the wrong implementation.
Question:
Can someone show me why my code is so slow, and also help me improve my understanding of file i/o?
Code:
private static XML noBOM(String filename, PApplet p) throws FileNotFoundException, IOException{
ByteArrayOutputStream out = new ByteArrayOutputStream();
File f = new File(filename);
InputStream stream = new FileInputStream(f);
BOMInputStream bomIn = new BOMInputStream(stream);
int tmp = -1;
while ((tmp = bomIn.read()) != -1){
out.write(tmp);
}
String strXml = out.toString();
return p.parseXML(strXml);
}
public static Map<String, Float> lifeExpectancyFromXML(String filename, PApplet p,
int year) throws FileNotFoundException, IOException{
Map<String, Float> dataMap = new HashMap<>();
XML xml = noBOM(filename, p);
if(xml != null){
XML[] records = xml.getChild("data").getChildren("record");
for (XML record : records){
XML[] fields = record.getChildren("field");
String country = fields[0].getContent();
int entryYear = fields[2].getIntContent();
float lifeEx = fields[3].getFloatContent();
if (entryYear == year){
System.out.println("Country: " + country);
System.out.println("Life Expectency: " + lifeEx);
dataMap.put(country, lifeEx);
}
}
}
else {
System.out.println("String could not be parsed.");
}
return dataMap;
}

Problem is probably, that InputStream is read byte by byte. Try to use buffer to make it more performant:
try (BOMInputStream bis = new BOMInputStream(new FileInputStream(new File(filename)))) {
byte[] buffer = new byte[1000];
while (bis.read(buffer) != -1) {
out.write(buffer);
}
}
Updated:
Resulting ByteArrayOutputStream may contain some empty bytes in the end. To remove them trim the resulting string:
out.toString("UTF-8").trim()

My solution was to use BufferedReader instead of creating my own buffer. It made everything quite speedy:
private static XML noBOM(String path, PApplet p) throws
FileNotFoundException, UnsupportedEncodingException, IOException{
//set default encoding
String defaultEncoding = "UTF-8";
//create BOMInputStream to get rid of any Byte Order Mark
BOMInputStream bomIn = new BOMInputStream(new FileInputStream(path));
//If BOM is present, determine encoding. If not, use UTF-8
ByteOrderMark bom = bomIn.getBOM();
String charSet = bom == null ? defaultEncoding : bom.getCharsetName();
//get buffered reader for speed
InputStreamReader reader = new InputStreamReader(bomIn, charSet);
BufferedReader breader = new BufferedReader(reader);
//Build string to parse into XML using Processing's PApplet.parsXML
StringBuilder buildXML = new StringBuilder();
int c;
while((c = breader.read()) != -1){
buildXML.append((char) c);
}
reader.close();
return p.parseXML(buildXML.toString());
}

Related

Unable to decompress gzipped files after uploading input stream chunks in S3

I'd like to take my input stream and upload gzipped parts to s3 in a similar fashion to the multipart uploader.
However, I want to store the individual file parts in S3 and not turn the parts into a single file.
To do so, I have created the following methods.
But, when I try to gzip decompress each part gzip throws an error and says: gzip: file_part_2.log.gz: not in gzip format.
I'm not sure if I am compressing each part correctly?
If I re-initialise the gzipoutputstream: gzip = new GZIPOutputStream(baos); and set gzip.finish() after reseting the byte array output stream baos.reset(); then I am able to decompress each part. Not sure why I need todo this, is there a similar reset for the gzipoutputstream?
public void upload(String bucket, String key, InputStream is, int partSize) throws Exception
{
String row;
BufferedReader br = new BufferedReader(new InputStreamReader(is, ENCODING));
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(baos);
int partCounter = 0;
int lineCounter = 0;
while ((row = br.readLine()) != null) {
if (baos.size() >= partSize) {
partCounter = this.uploadChunk(bucket, key, baos, partCounter);
baos.reset();
}else if(!row.equals("")){
row += '\n';
gzip.write(row.getBytes(ENCODING));
lineCounter++;
}
}
gzip.finish();
br.close();
baos.close();
if(lineCounter == 0){
throw new Exception("Aborting upload, file contents is empty!");
}
//Final chunk
if (baos.size() > 0) {
this.uploadChunk(bucket, key, baos, partCounter);
}
}
private int uploadChunk(String bucket, String key, ByteArrayOutputStream baos, int partCounter)
{
ObjectMetadata metaData = new ObjectMetadata();
metaData.setContentLength(baos.size());
String[] path = key.split("/");
String[] filename = path[path.length-1].split("\\.");
filename[0] = filename[0]+"_part_"+partCounter;
path[path.length-1] = String.join(".", filename);
amazonS3.putObject(
bucket,
String.join("/", path),
new ByteArrayInputStream(baos.toByteArray()),
metaData
);
log.info("Upload chunk {}, size: {}", partCounter, baos.size());
return partCounter+1;
}
The problem is that you're using a single GZipOutputStream for all chunks. So you're actually writing pieces of a GZipped file, which would have to be recombined to be useful.
Making the minimal change to your existing code:
if (baos.size() >= partSize) {
gzip.close();
partCounter = this.uploadChunk(bucket, key, baos, partCounter);
baos = baos = new ByteArrayOutputStream();
gzip = new GZIPOutputStream(baos);
}
You need to do the same at the end of the loop. Also, you shouldn't throw an exception if the line counter is 0: it's entirely possible that the file is exactly divisible into a set number of chunks.
To improve the code, I would wrap the GZIPOutputStream in an OutputStreamWriter and a BufferedWriter, so that you don't need to do the string-bytes conversion explicitly.
And lastly, don't use ByteArrayOutputStream.reset(). It doesn't save you anything over just creating a new stream, and opens the door for errors if you ever forget to reset.

InputStream returns unexpected -1/empty

I seem to be hitting a constant unexpected end of my file. My file contains first a couple of strings, then byte data.
The file contains a few separated strings, which my code reads correctly.
However when I begin to read the bytes, it returns nothing. I am pretty sure it has to do with me using the Readers. Does the BufferedReader read the entire stream? If so, how can I solve this?
I have checked the file, and it does contain plenty of data after the strings.
InputStreamReader is = new InputStreamReader(in);
BufferedReader br = new BufferedReader(is);
String line;
{
line = br.readLine();
String split[] = line.split(" ");
if (!split[0].equals("#binvox")) {
ErrorHandler.log("Not a binvox file");
return false;
}
ErrorHandler.log("Binvox version: " + split[1]);
}
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
int nRead, cnt = 0;
byte[] data = new byte[16384];
while ((nRead = in.read(data, 0, data.length)) != -1) {
buffer.write(data, 0, nRead);
cnt += nRead;
}
buffer.flush();
// cnt is always 0
The binvox format is as followed:
#binvox 1
dim 64 40 32
translate -3 0 -2
scale 6.434
data
[byte data]
I'm basically trying to convert the following C code to Java:
http://www.cs.princeton.edu/~min/binvox/read_binvox.html
For reading the whole String you should do this:
ArrayList<String> lines = new ArrayList<String>();
while ((line = br.readLine();) != null) {
lines.add(line);
}
and then you may do a cycle to split each line, or just do what you have to do during the cycle.
As icza has alraedy wrote, you can't create a InputStream and a BufferedReader and user both. The BufferedReader will read from the InputStream as many as he wants, and then you can't access your data from the InputStream.
You have several ways to fix it:
Don't use any Reader. Read the bytes yourself from an InputStream and call new String(bytes) on it.
Store your data encoded (e.g. Base64). Encoded data can be read from a Reader. I would recommend this solution. That'll look like that:
public byte[] readBytes (Reader in) throws IOException
{
String base64 = in.readLine(); // Note that a Base64-representation never contains \n
byte[] data = Base64.getDecoder().decode(base64);
return data
}
You can't wrap an InputStream in a BufferedReader and use both.
As its name hints, BufferedReader might read ahead and buffer data from the underlying InputStream which then will not be available when reading from the underlying InputStream directly.
Suggested solution is not to mix text and binary data in one file. They should be stored in 2 separate files and then they can be read separately. If the remaining data is not binary, then you should not read them via InputStream but via your wrapper BufferedReader just as you read the first lines.
I recommend to create a BinvoxDetectorStream that pre-reads some bytes
public class BinvoxDetectorStream extends InputStream {
private InputStream orig;
private byte[] buffer = new byte[4096];
private int buflen;
private int bufpos = 0;
public BinvoxDetectorStream(InputStream in) {
this.orig = new BufferedInputStream(in);
this.buflen = orig.read(this.buffer, 0, this.buffer.length);
}
public BinvoxInfo getBinvoxVersion() {
// creating a reader for the buffered bytes, to read a line, and compare the header
ByteArrayInputStream bais = new ByteArrayInputStream(buffer);
BufferedReader rdr = new BufferedReader(new InputStreamReader(bais)));
String line = rdr.readLine();
String split[] = line.split(" ");
if (split[0].equals("#binvox")) {
BinvoxInfo info = new BinvoxInfo();
info.version = split[1];
split = rdr.readLine().split(" ");
[... parse all properties ...]
// seek for "data\r\n" in the buffered data
while(!(bufpos>=6 &&
buffer[bufpos-6] == 'd' &&
buffer[bufpos-5] == 'a' &&
buffer[bufpos-4] == 't' &&
buffer[bufpos-3] == 'a' &&
buffer[bufpos-2] == '\r' &&
buffer[bufpos-1] == '\n') ) {
bufpos++;
}
return info;
}
return null;
}
#Override
public int read() throws IOException {
if(bufpos < buflen) {
return buffer[bufpos++];
}
return orig.read();
}
}
Then, you can detect the Binvox version without touching the original stream:
BinvoxDetectorStream bds = new BinvoxDetectorStream(in);
BinvoxInfo info = bds.getBinvoxInfo();
if (info == null) {
return false;
}
...
[moving bytes in the usual way, but using bds!!! ]
This way we preserve the original bytes in bds, so we'll be able to copy it later.
I saw someone else's code that solved exactly this.
He/she used DataInputStream, which can do a readLine (although deprecated) and readByte.

UTF-8 byte[] to String

Let's suppose I have just used a BufferedInputStream to read the bytes of a UTF-8 encoded text file into a byte array. I know that I can use the following routine to convert the bytes to a string, but is there a more efficient/smarter way of doing this than just iterating through the bytes and converting each one?
public String openFileToString(byte[] _bytes)
{
String file_string = "";
for(int i = 0; i < _bytes.length; i++)
{
file_string += (char)_bytes[i];
}
return file_string;
}
Look at the constructor for String
String str = new String(bytes, StandardCharsets.UTF_8);
And if you're feeling lazy, you can use the Apache Commons IO library to convert the InputStream to a String directly:
String str = IOUtils.toString(inputStream, StandardCharsets.UTF_8);
Java String class has a built-in-constructor for converting byte array to string.
byte[] byteArray = new byte[] {87, 79, 87, 46, 46, 46};
String value = new String(byteArray, "UTF-8");
To convert utf-8 data, you can't assume a 1-1 correspondence between bytes and characters.
Try this:
String file_string = new String(bytes, "UTF-8");
(Bah. I see I'm way to slow in hitting the Post Your Answer button.)
To read an entire file as a String, do something like this:
public String openFileToString(String fileName) throws IOException
{
InputStream is = new BufferedInputStream(new FileInputStream(fileName));
try {
InputStreamReader rdr = new InputStreamReader(is, "UTF-8");
StringBuilder contents = new StringBuilder();
char[] buff = new char[4096];
int len = rdr.read(buff);
while (len >= 0) {
contents.append(buff, 0, len);
}
return buff.toString();
} finally {
try {
is.close();
} catch (Exception e) {
// log error in closing the file
}
}
}
You can use the String(byte[] bytes) constructor for that. See this link for details.
EDIT You also have to consider your plateform's default charset as per the java doc:
Constructs a new String by decoding the specified array of bytes using
the platform's default charset. The length of the new String is a
function of the charset, and hence may not be equal to the length of
the byte array. The behavior of this constructor when the given bytes
are not valid in the default charset is unspecified. The
CharsetDecoder class should be used when more control over the
decoding process is required.
You could use the methods described in this question (especially since you start off with an InputStream): Read/convert an InputStream to a String
In particular, if you don't want to rely on external libraries, you can try this answer, which reads the InputStream via an InputStreamReader into a char[] buffer and appends it into a StringBuilder.
Knowing that you are dealing with a UTF-8 byte array, you'll definitely want to use the String constructor that accepts a charset name. Otherwise you may leave yourself open to some charset encoding based security vulnerabilities. Note that it throws UnsupportedEncodingException which you'll have to handle. Something like this:
public String openFileToString(String fileName) {
String file_string;
try {
file_string = new String(_bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// this should never happen because "UTF-8" is hard-coded.
throw new IllegalStateException(e);
}
return file_string;
}
Here's a simplified function that will read in bytes and create a string. It assumes you probably already know what encoding the file is in (and otherwise defaults).
static final int BUFF_SIZE = 2048;
static final String DEFAULT_ENCODING = "utf-8";
public static String readFileToString(String filePath, String encoding) throws IOException {
if (encoding == null || encoding.length() == 0)
encoding = DEFAULT_ENCODING;
StringBuffer content = new StringBuffer();
FileInputStream fis = new FileInputStream(new File(filePath));
byte[] buffer = new byte[BUFF_SIZE];
int bytesRead = 0;
while ((bytesRead = fis.read(buffer)) != -1)
content.append(new String(buffer, 0, bytesRead, encoding));
fis.close();
return content.toString();
}
String has a constructor that takes byte[] and charsetname as parameters :)
This also involves iterating, but this is much better than concatenating strings as they are very very costly.
public String openFileToString(String fileName)
{
StringBuilder s = new StringBuilder(_bytes.length);
for(int i = 0; i < _bytes.length; i++)
{
s.append((char)_bytes[i]);
}
return s.toString();
}
Why not get what you are looking for from the get go and read a string from the file instead of an array of bytes? Something like:
BufferedReader in = new BufferedReader(new InputStreamReader( new FileInputStream( "foo.txt"), Charset.forName( "UTF-8"));
then readLine from in until it's done.
I use this way
String strIn = new String(_bytes, 0, numBytes);

How can I read a .txt file into a single Java string while maintaining line breaks?

Virtually every code example out there reads a TXT file line-by-line and stores it in a String array. I do not want line-by-line processing because I think it's an unnecessary waste of resources for my requirements: All I want to do is quickly and efficiently dump the .txt contents into a single String. The method below does the job, however with one drawback:
private static String readFileAsString(String filePath) throws java.io.IOException{
byte[] buffer = new byte[(int) new File(filePath).length()];
BufferedInputStream f = null;
try {
f = new BufferedInputStream(new FileInputStream(filePath));
f.read(buffer);
if (f != null) try { f.close(); } catch (IOException ignored) { }
} catch (IOException ignored) { System.out.println("File not found or invalid path.");}
return new String(buffer);
}
... the drawback is that the line breaks are converted into long spaces e.g. " ".
I want the line breaks to be converted from \n or \r to <br> (HTML tag) instead.
Thank you in advance.
What about using a Scanner and adding the linefeeds yourself:
sc = new java.util.Scanner ("sample.txt")
while (sc.hasNext ()) {
buf.append (sc.nextLine ());
buf.append ("<br />");
}
I don't see where you get your long spaces from.
You can read directly into the buffer and then create a String from the buffer:
File f = new File(filePath);
FileInputStream fin = new FileInputStream(f);
byte[] buffer = new byte[(int) f.length()];
new DataInputStream(fin).readFully(buffer);
fin.close();
String s = new String(buffer, "UTF-8");
You could add this code:
return new String(buffer).replaceAll("(\r\n|\r|\n|\n\r)", "<br>");
Is this what you are looking for?
The code will read the file contents as they appear in the file - including line breaks.
If you want to change the breaks into something else like displaying in html etc, you will either need to post process it or do it by reading the file line by line. Since you do not want the latter, you can replace your return by following which should do the conversion -
return (new String(buffer)).replaceAll("\r[\n]?", "<br>");
StringBuilder sb = new StringBuilder();
try {
InputStream is = getAssets().open("myfile.txt");
byte[] bytes = new byte[1024];
int numRead = 0;
try {
while((numRead = is.read(bytes)) != -1)
sb.append(new String(bytes, 0, numRead));
}
catch(IOException e) {
}
is.close();
}
catch(IOException e) {
}
your resulting String: String result = sb.toString();
then replace whatever you want in this result.
I agree with the general approach by #Sanket Patel, but using Commons I/O you would likely want File Utils.
So your code word look like:
String myString = FileUtils.readFileToString(new File(filePath));
There is also another version to specify an alternate character encoding.
You should try org.apache.commons.io.IOUtils.toString(InputStream is) to get file content as String. There you can pass InputStream object which you will get from
getAssets().open("xml2json.txt") *<<- belongs to Android, which returns InputStream*
in your Activity. To get String use this :
String xml = IOUtils.toString((getAssets().open("xml2json.txt")));
So,
String xml = IOUtils.toString(*pass_your_InputStream_object_here*);

How to read a file into string in java?

I have read a file into a String. The file contains various names, one name per line. Now the problem is that I want those names in a String array.
For that I have written the following code:
String [] names = fileString.split("\n"); // fileString is the string representation of the file
But I am not getting the desired results and the array obtained after splitting the string is of length 1. It means that the "fileString" doesn't have "\n" character but the file has this "\n" character.
So How to get around this problem?
What about using Apache Commons (Commons IO and Commons Lang)?
String[] lines = StringUtils.split(FileUtils.readFileToString(new File("...")), '\n');
The problem is not with how you're splitting the string; that bit is correct.
You have to review how you are reading the file to the string. You need something like this:
private String readFileAsString(String filePath) throws IOException {
StringBuffer fileData = new StringBuffer();
BufferedReader reader = new BufferedReader(
new FileReader(filePath));
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
}
reader.close();
return fileData.toString();
}
Particularly i love this one using the java.nio.file package also described here.
You can optionally include the Charset as a second argument in the String constructor.
String content = new String(Files.readAllBytes(Paths.get("/path/to/file")));
Cool huhhh!
As suggested by Garrett Rowe and Stan James you can use java.util.Scanner:
try (Scanner s = new Scanner(file).useDelimiter("\\Z")) {
String contents = s.next();
}
or
try (Scanner s = new Scanner(file).useDelimiter("\\n")) {
while(s.hasNext()) {
String line = s.next();
}
}
This code does not have external dependencies.
WARNING: you should specify the charset encoding as the second parameter of the Scanner's constructor. In this example I am using the platform's default, but this is most certainly wrong.
Here is an example of how to use java.util.Scanner with correct resource and error handling:
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.util.Iterator;
class TestScanner {
public static void main(String[] args)
throws FileNotFoundException {
File file = new File(args[0]);
System.out.println(getFileContents(file));
processFileLines(file, new LineProcessor() {
#Override
public void process(int lineNumber, String lineContents) {
System.out.println(lineNumber + ": " + lineContents);
}
});
}
static String getFileContents(File file)
throws FileNotFoundException {
try (Scanner s = new Scanner(file).useDelimiter("\\Z")) {
return s.next();
}
}
static void processFileLines(File file, LineProcessor lineProcessor)
throws FileNotFoundException {
try (Scanner s = new Scanner(file).useDelimiter("\\n")) {
for (int lineNumber = 1; s.hasNext(); ++lineNumber) {
lineProcessor.process(lineNumber, s.next());
}
}
}
static interface LineProcessor {
void process(int lineNumber, String lineContents);
}
}
You could read your file into a List instead of a String and then convert to an array:
//Setup a BufferedReader here
List<String> list = new ArrayList<String>();
String line = reader.readLine();
while (line != null) {
list.add(line);
line = reader.readLine();
}
String[] arr = list.toArray(new String[0]);
There is no built-in method in Java which can read an entire file. So you have the following options:
Use a non-standard library method, such as Apache Commons, see the code example in romaintaz's answer.
Loop around some read method (e.g. FileInputStream.read, which reads bytes, or FileReader.read, which reads chars; both read to a preallocated array). Both classes use system calls, so you'll have to speed them up with bufering (BufferedInputStream or BufferedReader) if you are reading just a small amount of data (say, less than 4096 bytes) at a time.
Loop around BufferedReader.readLine. There has a fundamental problem that it discards the information whether there was a '\n' at the end of the file -- so e.g. it is unable to distinguish an empty file from a file containing just a newline.
I'd use this code:
// charsetName can be null to use the default charset.
public static String readFileAsString(String fileName, String charsetName)
throws java.io.IOException {
java.io.InputStream is = new java.io.FileInputStream(fileName);
try {
final int bufsize = 4096;
int available = is.available();
byte[] data = new byte[available < bufsize ? bufsize : available];
int used = 0;
while (true) {
if (data.length - used < bufsize) {
byte[] newData = new byte[data.length << 1];
System.arraycopy(data, 0, newData, 0, used);
data = newData;
}
int got = is.read(data, used, data.length - used);
if (got <= 0) break;
used += got;
}
return charsetName != null ? new String(data, 0, used, charsetName)
: new String(data, 0, used);
} finally {
is.close();
}
}
The code above has the following advantages:
It's correct: it reads the whole file, not discarding any byte.
It lets you specify the character set (encoding) the file uses.
It's fast (no matter how many newlines the file contains).
It doesn't waste memory (no matter how many newlines the file contains).
FileReader fr=new FileReader(filename);
BufferedReader br=new BufferedReader(fr);
String strline;
String arr[]=new String[10];//10 is the no. of strings
while((strline=br.readLine())!=null)
{
arr[i++]=strline;
}
The simplest solution for reading a text file line by line and putting the results into an array of strings without using third party libraries would be this:
ArrayList<String> names = new ArrayList<String>();
Scanner scanner = new Scanner(new File("names.txt"));
while(scanner.hasNextLine()) {
names.add(scanner.nextLine());
}
scanner.close();
String[] namesArr = (String[]) names.toArray();
I always use this way:
String content = "";
String line;
BufferedReader reader = new BufferedReader(new FileReader(...));
while ((line = reader.readLine()) != null)
{
content += "\n" + line;
}
// Cut of the first newline;
content = content.substring(1);
// Close the reader
reader.close();
You can also use java.nio.file.Files to read an entire file into a String List then you can convert it to an array etc. Assuming a String variable named filePath, the following 2 lines will do that:
List<String> strList = Files.readAllLines(Paths.get(filePath), Charset.defaultCharset());
String[] strarray = strList.toArray(new String[0]);
A simpler (without loops), but less correct way, is to read everything to a byte array:
FileInputStream is = new FileInputStream(file);
byte[] b = new byte[(int) file.length()];
is.read(b, 0, (int) file.length());
String contents = new String(b);
Also note that this has serious performance issues.
If you have only InputStream, you can use InputStreamReader.
SmbFileInputStream in = new SmbFileInputStream("smb://host/dir/file.ext");
InputStreamReader r=new InputStreamReader(in);
char buf[] = new char[5000];
int count=r.read(buf);
String s=String.valueOf(buf, 0, count);
You can add cycle and StringBuffer if needed.
You can try Cactoos:
import org.cactoos.io.TextOf;
import java.io.File;
new TextOf(new File("a.txt")).asString().split("\n")
Fixed Version of #Anoyz's answer:
import java.io.FileInputStream;
import java.io.File;
public class App {
public static void main(String[] args) throws Exception {
File f = new File("file.txt");
long fileSize = f.length();
String file = "test.txt";
FileInputStream is = new FileInputStream("file.txt");
byte[] b = new byte[(int) f.length()];
is.read(b, 0, (int) f.length());
String contents = new String(b);
}
}

Categories