GZIPInputStream reading line by line - java

I have a file in .gz format. The java class for reading this file is GZIPInputStream.
However, this class doesn't extend the BufferedReader class of java. As a result, I am not able to read the file line by line. I need something like this
reader = new MyGZInputStream( some constructor of GZInputStream)
reader.readLine()...
I though of creating my class which extends the Reader or BufferedReader class of java and use GZIPInputStream as one of its variable.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.Reader;
import java.util.zip.GZIPInputStream;
public class MyGZFilReader extends Reader {
private GZIPInputStream gzipInputStream = null;
char[] buf = new char[1024];
#Override
public void close() throws IOException {
gzipInputStream.close();
}
public MyGZFilReader(String filename)
throws FileNotFoundException, IOException {
gzipInputStream = new GZIPInputStream(new FileInputStream(filename));
}
#Override
public int read(char[] cbuf, int off, int len) throws IOException {
// TODO Auto-generated method stub
return gzipInputStream.read((byte[])buf, off, len);
}
}
But, this doesn't work when I use
BufferedReader in = new BufferedReader(
new MyGZFilReader("F:/gawiki-20090614-stub-meta-history.xml.gz"));
System.out.println(in.readLine());
Can someone advice how to proceed ..

The basic setup of decorators is like this:
InputStream fileStream = new FileInputStream(filename);
InputStream gzipStream = new GZIPInputStream(fileStream);
Reader decoder = new InputStreamReader(gzipStream, encoding);
BufferedReader buffered = new BufferedReader(decoder);
The key issue in this snippet is the value of encoding. This is the character encoding of the text in the file. Is it "US-ASCII", "UTF-8", "SHIFT-JIS", "ISO-8859-9", …? there are hundreds of possibilities, and the correct choice usually cannot be determined from the file itself. It must be specified through some out-of-band channel.
For example, maybe it's the platform default. In a networked environment, however, this is extremely fragile. The machine that wrote the file might sit in the neighboring cubicle, but have a different default file encoding.
Most network protocols use a header or other metadata to explicitly note the character encoding.
In this case, it appears from the file extension that the content is XML. XML includes the "encoding" attribute in the XML declaration for this purpose. Furthermore, XML should really be processed with an XML parser, not as text. Reading XML line-by-line seems like a fragile, special case.
Failing to explicitly specify the encoding is against the second commandment. Use the default encoding at your peril!

GZIPInputStream gzip = new GZIPInputStream(new FileInputStream("F:/gawiki-20090614-stub-meta-history.xml.gz"));
BufferedReader br = new BufferedReader(new InputStreamReader(gzip));
br.readLine();

BufferedReader in = new BufferedReader(new InputStreamReader(
new GZIPInputStream(new FileInputStream("F:/gawiki-20090614-stub-meta-history.xml.gz"))));
String content;
while ((content = in.readLine()) != null)
System.out.println(content);

You can use the following method in a util class, and use it whenever necessary...
public static List<String> readLinesFromGZ(String filePath) {
List<String> lines = new ArrayList<>();
File file = new File(filePath);
try (GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(file));
BufferedReader br = new BufferedReader(new InputStreamReader(gzip));) {
String line = null;
while ((line = br.readLine()) != null) {
lines.add(line);
}
} catch (FileNotFoundException e) {
e.printStackTrace(System.err);
} catch (IOException e) {
e.printStackTrace(System.err);
}
return lines;
}

here is with one line
try (BufferedReader br = new BufferedReader(
new InputStreamReader(
new GZIPInputStream(
new FileInputStream(
"F:/gawiki-20090614-stub-meta-history.xml.gz")))))
{br.readLine();}

Related

unalble to read clear data from pdf file as other language is not english

I am trying to copy some data from pdf to txt file here is the code
public void readPDFFile() throws IOException {
InputStreamReader reader;
OutputStreamWriter writer;
FileInputStream inputstream;
FileOutputStream outputStream;
BufferedReader bufferedReader = null;
BufferedWriter bufferedWriter = null;
String str;
File rfile = new File(
"C://Documents and Settings/Administrator/My Documents/EGDownloads/source.pdf");
File wFile = new File("C://Documents and Settings/Administrator/My Documents/Folder/destination.txt");
try {
inputstream = new FileInputStream(rfile);
outputStream = new FileOutputStream(wFile);
reader = new InputStreamReader(inputstream, "UTF-8");
writer = new OutputStreamWriter(outputStream, "UTF-8");
bufferedReader = new BufferedReader(reader);
bufferedWriter = new BufferedWriter(writer);
while ((str = bufferedReader.readLine()) != null) {
writer.write(str);
}
} catch (IOException es) {
System.out.println(es.getMessage());
es.printStackTrace(System.out);
} finally {
if (bufferedReader != null) {
bufferedReader.close();
}
if (bufferedWriter != null)
bufferedWriter.close();
}
}
Expected output is supposed in other language but all I am getting is some random boxes as tried both UTF-16 and UTF-8 unicodes
I tried pdfBox but is still not working as all I'm getting is only original language accent and in english language
Note :
1 I'm not trying to print data on console but copying from pdf to txt file
2 Other file contains non english words,
can anyone help me to solve that??
Or any link that might help
Thanks.
The PDF format is a binary format. You must have a really special PDF as all that I know of are compressed in some way. Use a proper library to read it, be it pdfbox or itext or other. Be aware that in some PDFs it's impossible to extract text, you can check it with Acrobat, if Acrobat can't do it nobody can.

File encoding : saved content is different than when read

I have a slight problem trying to save a file in java.
For some reason the content I get after saving my file is different from what I have when I read it.
I guess this is related to file encoding, but without being sure.
Here is test code I put together. The idea is basically to read a file, and save it again.
When I open both files, they are different.
package workspaceFun;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import org.apache.commons.codec.DecoderException;
public class FileSaveTest {
public static void main(String[] args) throws IOException, DecoderException{
String location = "test.location";
File locationFile = new File(location);
FileInputStream fis = new FileInputStream(locationFile);
InputStreamReader r = new InputStreamReader(fis, Charset.forName("UTF-8"));
System.out.println(r.getEncoding());
StringBuilder builder = new StringBuilder();
int ch;
while((ch = fis.read()) != -1){
builder.append((char)ch);
}
String fullLocationString = builder.toString();
//Now we want to save back
FileOutputStream fos = new FileOutputStream("C:/Users/me/Desktop/test");
byte[] b = fullLocationString.getBytes();
fos.write(b);
fos.close();
r.close();
}
}
An extract from the input file (opened as plain text using Sublime 2):
40b1 8b81 23bc 0014 1a25 96e7 a393 be1e
and from the output file :
40c2 b1c2 8bc2 8123 c2bc 0014 1a25 c296
The getEncoding method returns "UTF8". Trying to save the output file using the same charset doest not seem to solve the issue.
What puzzles me is that when I try to read the input file using Hex from apache.commons.codec like this :
String hexLocationString2 = Hex.encodeHexString(fullLocationString.getBytes("UTF-8"));
The String already looks like my output file, not the input.
Would you have any idea on what can go wrong?
Thanks
Extra info for those being interested, I am trying to read an eclipse .location file.
EDIT: I placed the file online so that you can test the code
I believe is the way you are reading the stream.
You are using FileInputStream directly to read the content instead of wrapping it in the InputStreamReader
By using the InputStreamReader you may determine which Charset to use.
Take in consideration that the Charset defined in the InputStream must be the same you expect as InputStream doesn't detect charsets, it just reads them in that specific format.
Try the following changes:
InputStreamReader r = new InputStreamReader(new FileInputStream(locationFile), StandardCharsets.UTF_8);
then instead of fos.read() use r.read()
Finally when writing the String get the bytes in the same Charset as your Reader
FileOutputStream fos = new FileOutputStream("C:/Users/me/Desktop/test");
fos.write(fullLocationString.getBytes(StandardCharsets.UTF_8));
fos.close()
Try to read and write back as below:
public class FileSaveTest {
public static void main(String[] args) throws IOException {
String location = "D:\\test.txt";
BufferedReader br = new BufferedReader(new FileReader(location));
StringBuilder sb = new StringBuilder();
try {
String line = br.readLine();
while (line != null) {
sb.append(line);
line = br.readLine();
if (line != null)
sb.append(System.lineSeparator());
}
} finally {
br.close();
}
FileOutputStream fos = new FileOutputStream("D:\\text_created.txt");
byte[] b = sb.toString().getBytes();
fos.write(b);
fos.close();
}
}
Test file contains both Cirillic and Latin characters.
SDFASDF
XXFsd1
12312
іва

Read and Write Text in ANSI format

Please have a look at the following code
import java.io.*;
public class CSVConverter
{
private File csvFile;
private BufferedReader reader;
private StringBuffer strBuffer;
private BufferedWriter writer;
int startNumber = 0;
private String strString[];
public CSVConverter(String location, int startNumber)
{
csvFile = new File(location);
strBuffer = new StringBuffer("");
this.startNumber = startNumber;
//Read
try
{
reader = new BufferedReader(new FileReader(csvFile));
String line = "";
while((line=reader.readLine())!=null)
{
String[] array = line.split(",");
String inputQuery = "insertQuery["+startNumber+"] = \"insert into WordList_Table ('Engl','Port','EnglishH','PortugueseH','Numbe','NumberOf','NumberOfTime','NumberOfTimesPor')values('"+array[0]+"','"+array[2]+"','"+array[1]+"','"+array[3]+"',0,0,0,0)\"";
strBuffer.append(inputQuery+";"+"\r\n");
startNumber++;
}
}
catch(Exception e)
{
e.printStackTrace();
}
System.out.println(strBuffer.toString());
//Write
try
{
File file = new File("C:/Users/list.txt");
FileWriter filewrite = new FileWriter(file);
if(!file.exists())
{
file.createNewFile();
}
writer = new BufferedWriter(filewrite);
writer.write(strBuffer.toString());
writer.flush();
writer.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
public static void main(String[]args)
{
new CSVConverter("C:/Users/list.csv",90);
}
}
I am trying to read a CSV file, edit the text in code, and write it back to a .txt file. My issue is, I have Portuguese words, so the file should be read and write using ANSI format. Right now some Portuguese words are replaced with symbols in the output file.
How can I read and write text data into a file in ANSI format in Java?
To read a text file with a specific encoding you can use a FileInputStream in conjunction with a InputStreamReader. The right Java encoding for Windows ANSI is Cp1252.
reader = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), "Cp1252"));
To write a text file with a specific character encoding you can use a FileOutputStream together with a OutputStreamWriter.
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "Cp1252"));
The classes InputStreamReader and OutputStreamWriter translate between byte oriented streams and text with a specific character encoding.

Java zip character encoding

I'm using the following method to compress a file into a zip file:
import java.util.zip.CRC32;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;
public static void doZip(final File inputfis, final File outputfis) throws IOException {
FileInputStream fis = null;
FileOutputStream fos = null;
final CRC32 crc = new CRC32();
crc.reset();
try {
fis = new FileInputStream(inputfis);
fos = new FileOutputStream(outputfis);
final ZipOutputStream zos = new ZipOutputStream(fos);
zos.setLevel(6);
final ZipEntry ze = new ZipEntry(inputfis.getName());
zos.putNextEntry(ze);
final int BUFSIZ = 8192;
final byte inbuf[] = new byte[BUFSIZ];
int n;
while ((n = fis.read(inbuf)) != -1) {
zos.write(inbuf, 0, n);
crc.update(inbuf);
}
ze.setCrc(crc.getValue());
zos.finish();
zos.close();
} catch (final IOException e) {
throw e;
} finally {
if (fis != null) {
fis.close();
}
if (fos != null) {
fos.close();
}
}
}
My problem is that i have flat text files with the content N°TICKET for example, the zipped result gives some weired characters when uncompressed N° TICKET. Also characters such as é and à are not supported.
I guess it's due to the character encoding, but I don't know how to set it in my zip method to ISO-8859-1 ?
(I'm running on windows 7, java 6)
You are using streams which write exactly the bytes that they are given. Writers interpret character data and convert it to the corresponding bytes and Readers do the opposite. Java (at least in version 6) doesn't provide an easy way to to mix and match operations on zipped data and for writing characters.
This way will work though. It is, however, a little clunky.
File inputFile = new File("utf-8-data.txt");
File outputFile = new File("latin-1-data.zip");
ZipEntry entry = new ZipEntry("latin-1-data.txt");
BufferedReader reader = new BufferedReader(new FileReader(inputFile));
ZipOutputStream zipStream = new ZipOutputStream(new FileOutputStream(outputFile));
BufferedWriter writer = new BufferedWriter(
new OutputStreamWriter(zipStream, Charset.forName("ISO-8859-1"))
);
zipStream.putNextEntry(entry);
// this is the important part:
// all character data is written via the writer and not the zip output stream
String line = null;
while ((line = reader.readLine()) != null) {
writer.append(line).append('\n');
}
writer.flush(); // i've used a buffered writer, so make sure to flush to the
// underlying zip output stream
zipStream.closeEntry();
zipStream.finish();
reader.close();
writer.close();
Afaik this is not available in Java 6.
But I do believe that http://commons.apache.org/compress/ can provide a solution.
Switching to Java 7 provides a new constructor that that encoding as an additional parameter.
https://blogs.oracle.com/xuemingshen/entry/non_utf_8_encoding_in
zipStream = new ZipInputStream(
new BufferedInputStream(new FileInputStream(archiveFile), BUFFER_SIZE),
Charset.forName("ISO-8859-1")
try to use org.apache.commons.compress.archivers.zip.ZipFile; not java's own library so you can give encoding like that:
import org.apache.commons.compress.archivers.zip.ZipFile;
ZipFile zipFile = new ZipFile(filepath,encoding);

Character corruption going from BufferedReader to BufferedWriter in java

In Java, I am trying to parse an HTML file that contains complex text such as greek symbols.
I encounter a known problem when text contains a left facing quotation mark. Text such as
mutations to particular “hotspot” regions
becomes
mutations to particular “hotspot�? regions
I have isolated the problem by writting a simple text copy meathod:
public static int CopyFile()
{
try
{
StringBuffer sb = null;
String NullSpace = System.getProperty("line.separator");
Writer output = new BufferedWriter(new FileWriter(outputFile));
String line;
BufferedReader input = new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
{
sb = new StringBuffer();
//Parsing would happen
sb.append(line);
output.write(sb.toString()+NullSpace);
}
return 0;
}
catch (Exception e)
{
return 1;
}
}
Can anybody offer some advice as how to correct this problem?
★My solution
InputStream in = new FileInputStream(myFile);
Reader reader = new InputStreamReader(in,"utf-8");
Reader buffer = new BufferedReader(reader);
Writer output = new BufferedWriter(new FileWriter(outputFile));
int r;
while ((r = reader.read()) != -1)
{
if (r<126)
{
output.write(r);
}
else
{
output.write("&#"+Integer.toString(r)+";");
}
}
output.flush();
The file read is not in the same encoding (probably UTF-8) as the file written (probably ISO-8859-1).
Try the following to generate a file with UTF-8 encoding:
BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));
Unfortunately, determining the encoding of a file is very difficult. See Java : How to determine the correct charset encoding of a stream
In addition to what Thierry-Dimitri Roy wrote, if you know the encoding you have to create your FileReader with a bit of extra work. From the docs:
Convenience class for reading
character files. The constructors of
this class assume that the default
character encoding and the default
byte-buffer size are appropriate. To
specify these values yourself,
construct an InputStreamReader on a
FileInputStream.
The Javadoc for FileReader says:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
In your case the default character encoding is probably not appropriate. Find what encoding the input file uses, and specify it. For example:
FileInputStream fis = new FileInputStream(myFile);
InputStreamReader isr = new InputStreamReader(fis, "charset name goes here");
BufferedReader input = new BufferedReader(isr);

Categories