java: reading large file with charset

java: reading large file with charset - java

My file is 14GB and I would like to read line by line and will be export to excel file.
As the file include different language, such as Chinese and English,
I tried to use FileInputStream with UTF-16 for reading data,
but result in java.lang.OutOfMemoryError: Java heap space
I have tried to increase the heap space but problem still exist
How should I change my file reading code?
createExcel(); //open a excel file
try {
//success but cannot read and output for different language
//br = new BufferedReader(
// new FileReader("C:\\Users\\brian_000\\Desktop\\appdatafile.json"));
//result in java.lang.OutOfMemoryError: Java heap space
br = new BufferedReader(new InputStreamReader(
new FileInputStream("C:\\Users\\brian_000\\Desktop\\appdatafile.json"),
"UTF-16"));
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("cann be print");
String line;
int i=0;
try {
while ((line = br.readLine()) != null) {
// process the line.
try{
System.out.println("cannot be print");
//some statement for storing the data in variables.
//a function for writing the variable into excel
writeToExcel(platform,kind,title,shareUrl,contentRating,userRatingCount,averageUserRating
,marketLanguage,pricing
,majorVersionNumber,releaseDate,downloadsCount);
}
catch(com.google.gson.JsonSyntaxException exception){
System.out.println("error");
}
// trying to get the first 1000rows
i++;
if(i==1000){
br.close();
break;
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
closeExcel();
public static void writeToExcel(String platform,String kind,String title,String shareUrl,String contentRating,String userRatingCount,String averageUserRating
,String marketLanguage,String pricing,String majorVersionNumber,String releaseDate,String downloadsCount){
currentRow++;
System.out.println(currentRow);
if(currentRow>1000000){
currentsheet++;
sheet = workbook.createSheet("apps"+currentsheet, 0);
createFristRow();
currentRow=1;
}
try {
//character id
Label label = new Label(0, currentRow, String.valueOf(currentRow), cellFormat);
sheet.addCell(label);
//12 of statements for write the data to excel
label = new Label(1, currentRow, platform, cellFormat);
sheet.addCell(label);
} catch (WriteException e) {
e.printStackTrace();
}

Excel, UTF-16
As mentioned, the problem is likely caused by the Excel document construction. Try whether UTF-8 yields a lesser size; for instance Chinese HTML still is better compressed with UTF-8 rather than UTF-16 because of the many ASCII chars.
Object creation java
You can share common small Strings. Useful for String.valueOf(row) and such. Cache only strings with a small length. I assume the cellFormat to be fixed.
DIY with xlsx
Excel builds a costly DOM.
If CSV text (with a Unicode BOM marker) is no options (you could give it the extension .xls to be opened by Excel), try generating an xslx.
Create an example workbook in xslx.
This is a zip format you can process in java easiest with a zip filesystem.
For Excel there is a content XML and a shared XML, sharing cell values with an index from content to shared strings.
Then no overflow happens as you write buffer-wise.
Or use a JDBC driver for Excel. (No recent experience on my side, maybe JDBC/ODBC.)
Best
Excel is hard to use with that much data. Consider more effort using a database, or write every N rows in a proper Excel file. Maybe you can later import them with java in one document. (I doubt it.)

Related

TimeSeries Forecasting Weka - Java API

I am trying to implement TimeSeries Forecasting in a JavaService in webMethods. My Code is not working and i am completely lost so i would be glad if you could help me! FYI: I used this Tutorial.
This is the Exception i get: com.wm.lang.flow.FlowException: weka.core.expressionlanguage.parser.Parser.getSymbolFactory()Ljava_cup/runtime/SymbolFactory;
I just post the part which is not webMethods specific (normal Java):
In the first part i am building an ARFF File which works fine. Because i saved the file and opened it with the weka Explorer and everything looks fine.
The ARFF file looks like this:
#relation Rel
#attribute Count numeric
#data
2758
2797
2861
575
505
4029
(just with some more values (59 in total))
I want to forecast the next 3 values.
Forecasting Part:
// At the berginning i create and save the arff file, so i have an Instances
object called 'dataset'
WekaForecaster forecaster = new WekaForecaster();
try {
forecaster.setFieldsToForecast("Count");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
forecaster.setBaseForecaster(new GaussianProcesses());
forecaster.getTSLagMaker().setTimeStampField("Date");
forecaster.getTSLagMaker().setMinLag(1);
forecaster.getTSLagMaker().setMaxLag(12);
forecaster.getTSLagMaker().setAddMonthOfYear(true);
forecaster.getTSLagMaker().setAddQuarterOfYear(true);
PrintStream stream = null;
List<List<NumericPrediction>> forecast = null;
try {
stream = new PrintStream("./path/forecast.txt");
forecaster.buildForecaster(dataset, stream);
forecaster.primeForecaster(dataset);
forecast = forecaster.forecast(3, dataset, stream);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// output the predictions
for (int i = 0; i < 3; i++) {
List<NumericPrediction> predsAtStep = forecast.get(i);
NumericPrediction predForTarget = predsAtStep.get(0);
stream.print("" + predForTarget.predicted() + " ");
stream.println();
}
The Java Code is hard to debug in webMethods, but it seems that forecaster.buildForecaster(dataset, stream); is causing the Exception.
What am i missing?

ObjectInputStream - reading large binary file - problems with memory

Before I proceed to my question : please note that I am not working on any client-server application that would require serialization, but the program I am trying to customize stores one big instance of one big class in a .dat file. I have read about this issue (memory leak in ObjectOutputStream and ObjectInputStream)and the fact that I could probably need to :
use the ObjectOutputStream.reset() method after writing the class instance in the .dat file, so that it doesn't hold the reference anymore;
re-write the code without using serialization;
split the file and read it in chunks;
change the JVM memory parameter by using -Xmx;
So, I was provided with one class that generates a language model and saves it with a .dat extension; the code was probably optimized for small model files (there are 2 model files provided as examples, both around 10MB ), but I generated a much larger model class, and it is around 40MB. Then, there is another class in another folder, totally independent on the first one, that uses this model, and the model has to be loaded using ObjectInputStream. Here comes the problem : a classic "OutOfMemoryError : Java heap space".
Writing the object:
try {
// Create an output stream to the file.
FileOutputStream file_output = new FileOutputStream (file);
ObjectOutputStream o = new ObjectOutputStream( file_output );
o.writeObject(this);
file_output.close ();
}
catch (IOException e) {
System.err.println ("IO exception = " + e );
}
Reading the object:
InputStream model = null;
ModelGeneration oRead = null;
ObjectInputStream p = null;
try {
model = new FileInputStream(filename);
BufferedInputStream buf = new BufferedInputStream(model);
p = new ObjectInputStream(buf);
oRead = (ModelGeneration) p.readObject();
p.reset();
} catch (IOException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
} finally {
try {
model.close();
} catch (Exception e) {
e.printStackTrace();
}
}
I tried to use the reset() method, but it is useless because we load only one instance of one class at a time, nothing else needed. This is why I can't split the file, too: only one class instance is stored in the .dat file.
Changing the heap space seems like a worse solution than optimizing the code.
I would really appreciate your advice on what I can do.
Btw the code is here : http://svn.apache.org/repos/asf/uima/addons/trunk/Tagger/, I only implemented the required classes for a different language.
P.S. Works fine if I create a smaller model, but I would prefer the bigger one.

Encoding and decoding random byte array with zxing

I'm trying to transfer a byte array with QR code, so for testing, I decided to generate a random byte array, encode it as QR code, then decode it. I used ISO-8859-1 to convert byte array to string s.t it does not lose data while transmission:
For encoder side:
byte []buffer = new byte[11];
com.google.zxing.Writer writer = new QRCodeWriter();
Random randomGenerator = new Random();
for(int i=0;i<=10;i++){
buffer[i]=(byte) randomGenerator.nextInt(254);
}
// Log.i("time1","original: "+Arrays.toString(buffer));
String decoded = null;
try {
decoded = new String(buffer, "ISO-8859-1");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
result=writer.encode(decoded, BarcodeFormat.QR_CODE, 500, 500);
} catch (WriterException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
In this way I have converted byte array to QR code, it has no problem.
But for the receiver side:
LuminanceSource source = new PlanarYUVLuminanceSource(data,640,480,0,0,640,480,false);
bmtobedecoded = new BinaryBitmap(new HybridBinarizer(source));
Map<DecodeHintType,Object> mp=new HashMap<DecodeHintType, Object>();
mp.put(DecodeHintType.TRY_HARDER, true);
try {
result= qrr.decode(bmtobedecoded,mp);
} catch (NotFoundException e) {
Log.i("123","not found");
e.printStackTrace();
} catch (ChecksumException e) {
Log.i("123","checksum");
e.printStackTrace();
} catch (FormatException e) {
Log.i("123","format");
e.printStackTrace();
}
I tried to decode the generated QR code, but it throws out NotFoundException.
Can someone help me with this issue?
Update 1: I confirmed that the decoder works perfectly with the normal QR, I also added DecodeHintType.try_harder but still no good.
Update 2: To clarify, below is what I did to convert between byte array and string:
Random randomGenerator = new Random();
for(int i=0;i<=10;i++){
buffer[i]=(byte) randomGenerator.nextInt(254);
}
Log.i("time1","original: "+Arrays.toString(buffer));
String decoded = null;
try {
decoded = new String(buffer, "ISO-8859-1");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Log.i("time1","encoded string:" + decoded);
BitMatrix result=null;
try {
result=qw.encode(decoded, BarcodeFormat.QR_CODE, 500, 500);
} catch (WriterException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
iv.setImageBitmap(encodematrix(result));
byte[] encoded = null;
try {
encoded = decoded.getBytes("ISO-8859-1");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Log.i("time1","result byte array:" + java.util.Arrays.toString(encoded));
If you run this, you can easily see that you can get exactly the same array in the end. I have no problem with this.
Update 3: I also tried encoding it using UTF-8, but it loses data, so it cannot be the used in encoder.
Update 4: Just added:
Map<DecodeHintType,Object> mp=new HashMap<DecodeHintType, Object>();
mp.put(DecodeHintType.CHARACTER_SET, "ISO-8859-1");
in the decoder, still throwing out exception.

Try PURE_BARCODE mode as a detection hint. Strangely, false positive detection of finder patterns is a much bigger problem when the image is just a pure synthetic image. The heuristics assume a photo, which doesn't have these problems. In this alternate mode it can take advantage of knowing it's a pure image and not a photo and be much faster and never get the detection wrong.

There are two issues that you have to overcome to store binary data in QR codes.
ISO-8859-1 does not allow bytes in ranges of 00-1F and 7F-9F. Since you are using a random generator, you can just check whether a randomly generated byte fits this value and re-generate it until you get a random byte that fits this range. If you
nevertheless need to encode these bytes anyway, you may encode the array as a Base-64 string or as a hexadecimal string. In case of a hexadecimal string, it will be stored in the QR code in the alphanumeric mode, not in 8-bit mode.
Since you are trying to store binary data in QR codes, you have to
rely only on your own scanner that will handle this binary data, and you need to make sure that your scanner does not use heuristics to automatically determine character encoding, and so on. Most QR decoders use heuristics to detect the character
set used. These heuristics may detect a character set other than
ISO-8859-1 and thus fail to properly display your binary data. Some
scanners use heuristics to detect a character set even if the
character set is explicitly given by the ECI optional extension inside the QR Code.
So, using US-ASCII only (e.g., binary data encoded in Base64 before passing it to a QR Code generator) is the safest choice for QR code against the heuristics. This will also overcome another complication: that ISO-8859-1 was not the default encoding in earlier QR code standard published in 2000 (ISO/IEC 18004:2000). That standard did specify 8-bit Latin/Kana character set in accordance with JIS X 0201 (JIS8 also known as ISO-2022-JP) as default encoding for 8-bit mode, while the updated standard published in 2005 did change the default to ISO-8859-1.
If you store your buffer as a hexadecimal string in your QR code, this will disable all heuristics for sure and should not produce larger QR Code than with Base-64, because each character in the alphanumeric mode takes only 6 bits in the QR code stream.

using CSVWriter to export database tables with BLOB

I've already tried exporting my database tables to CSV using the CSVWriter.
But my tables contain BLOB data. How can I include them in my export?
Then later on, im going to import that exported CSV using CSVReader. Can anyone share some concepts?
This is a part of my code for export
ResultSet res = st.executeQuery("select * from "+db+"."+obTableNames[23]);
int colunmCount = getColumnCount(res);
try {
File filename = new File(dir,""+obTableNames[23]+".csv");
fw = new FileWriter(filename);
CSVWriter writer = new CSVWriter(fw);
writer.writeAll(res, false);
int colType = res.getMetaData().getColumnType(colunmCount);
dispInt(colType);
fw.flush();
fw.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

Did you take a look at encodeBase64String(byte[] data) method from the Base64 provided by Apache?
Encodes binary data using the base64 algorithm but does not chunk the output.
This should allow you to return encoded strings representing your Binary Large Object and incorporate it in your CSV.
People on the other side can then use the decodeBase64String(String data) to get the BLOB back again.

how to intentionally corrupt a file in java

Note: Please do not judge this question. To those who think that I am doing this to "cheat"; you are mistaken, as I am no longer in school anyway. In addition, if I was, myself actually trying to cheat, I would simply use services that have already been created for this, instead of recreating the program. I took on this project because I thought it might be fun, nothing else. Before you down-vote, please consider the value of the question it's self, and not the speculative uses of it, as the purpose of SO is not to judge, but simply give the public information.
I am developing a program in java that is supposed intentionally corrupt a file (specifically a .doc, txt, or pdf, but others would be good as well)
I initially tried this:
public void corruptFile (String pathInName, String pathOutName) {
curroptMethod method = new curroptMethod();
ArrayList<Integer> corruptHash = corrupt(getBytes(pathInName));
writeBytes(corruptHash, pathOutName);
new MimetypesFileTypeMap().getContentType(new File(pathInName));
// "/home/ephraim/Desktop/testfile"
}
public ArrayList<Integer> getBytes(String filePath) {
ArrayList<Integer> fileBytes = new ArrayList<Integer>();
try {
FileInputStream myInputStream = new FileInputStream(new File(filePath));
do {
int currentByte = myInputStream.read();
if(currentByte == -1) {
System.out.println("broke loop");
break;
}
fileBytes.add(currentByte);
} while (true);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(fileBytes);
return fileBytes;
}
public void writeBytes(ArrayList<Integer> hash, String pathName) {
try {
OutputStream myOutputStream = new FileOutputStream(new File(pathName));
for (int currentHash : hash) {
myOutputStream.write(currentHash);
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
//System.out.println(hash);
}
public ArrayList<Integer> corrupt(ArrayList<Integer> hash) {
ArrayList<Integer> corruptHash = new ArrayList<Integer>();
ArrayList<Integer> keywordCodeArray = new ArrayList<Integer>();
Integer keywordIndex = 0;
String keyword = "corruptthisfile";
for (int i = 0; i < keyword.length(); i++) {
keywordCodeArray.add(keyword.codePointAt(i));
}
for (Integer currentByte : hash) {
//Integer currentByteProduct = (keywordCodeArray.get(keywordIndex) + currentByte) / 2;
Integer currentByteProduct = currentByte - keywordCodeArray.get(keywordIndex);
if (currentByteProduct < 0) currentByteProduct += 255;
corruptHash.add(currentByteProduct);
if (keywordIndex == (keyword.length() - 1)) {
keywordIndex = 0;
} else keywordIndex++;
}
//System.out.println(corruptHash);
return corruptHash;
}
but the problem is that the file is still openable. When you open the file, all of the words are changed (and they may not make any sense, and they may not even be letters, but it can still be opened)
so here is my actual question:
Is there a way to make a file so corrupt that the computer doesn't know how to open it at all (ie. when you open it, the computer will say something along the lines of "this file is not recognized, and cannot be opened")?

I think you want to look into the RandomAccessFile. Also, it is almost always the case that a program recognizes its file by its very start. So open the file and scramble the first 5 bytes.

The only way to fully corrupt an arbitrary file is to replace all of its contents with random garbage. Even then, there is an infinitely small probability that the random garbage will actually be something meaningful.
Depending on the file type, it may be possible to recover from limited - or even from not so limited - corruption. E.g.:
Streaming media codecs are designed with network packet loss take into account. Limited corruption may show up as picture artifacts, or even as a few lost frames, but the content is usually still viewable.
Block-based compression algorithms, such as bzip2, allow undamaged blocks to be recovered.
File-based compression systems such as rar and zip may be able to recover those files whose compressed data has not been damaged, regardless of damage to the rest of the archive.
Human-readable text, such as text files and source code files, is still viewable in a text editor, even if parts of it are corrupt - not to mention its size that does not change. Unless you corrupted the whole thing, any casual reader would be able to tell whether an assignment was done and whether the retransmitted file was the same as the one that got corrupted.
Apart from the ethical issue, have you considered that this would be a one-time thing only? Data corruption does happen, but it's not that frequent and it's never that convenient...
If you are that desperate for more time, you would be better off breaking your leg and getting yourself admitted to a hospital.

There are better ways:
Your professor accepts Word documents. Infect it with a macro virus before sending.
"Forget" to attach the file to the email.
Forge the send date on your email. If your prof is the kind that accepts Word docs, this may work.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.