try {
// CompareRecord record = new CompareRecord();
Connection conn = new CompareRecord().getConection("eliteddaprd","eliteddaprd","192.168.14.104","1521");
ResultSet res = null;
if (conn != null){
Statement stmt = conn.createStatement();
res = stmt.executeQuery("select rowindx,ADDRLINE1 from dedupinitial order by rowindx");
}
Map<Integer,String> adddressMap = new LinkedHashMap<Integer, String>();
if (res != null){
System.out.println("result set is not null ");
while(res.next()){
adddressMap.put(res.getInt(1),res.getString(2));
}
}
System.out.println("address Map size =========> "+adddressMap.size());
Iterator it = adddressMap.entrySet().iterator();
int count = 0;
int min = 0;
while (it.hasNext()){
Map.Entry pairs = (Map.Entry)it.next();
Pattern p = Pattern.compile("[,\\s]+");
Integer outerkey = (Integer)pairs.getKey();
String outerValue = (String)pairs.getValue();
//System.out.println("outer Value ======> "+outerValue);
String[] outerresult = p.split(outerValue);
Map.Entry pairs2 = null;
count++;
List<Integer> dupList = new ArrayList<Integer>();
Iterator innerit = adddressMap.entrySet().iterator();
boolean first = true;
while (innerit.hasNext()){
//System.out.println("count value ===> "+count);
int totmatch = 0;
if(first){
if(count == adddressMap.size()){
break;
}
for(int i=0;i<=count;i++){
pairs2 = (Map.Entry)innerit.next();
}
first = false;
}
else{
pairs2 = (Map.Entry)innerit.next();
}
Integer innterKey = (Integer)pairs2.getKey();
String innerValue = (String)pairs2.getValue();
//System.out.println("innrer value "+innerValue);
String[] innerresult = p.split(innerValue);
for(int j=0;j<outerresult.length;j++){
for(int k=0;k<innerresult.length;k++){
if(outerresult[j].equalsIgnoreCase(innerresult[k])){
//System.out.println(outerresult[j]+" Match With "+innerresult[k]);
totmatch++;
break;
}
}
}
min = Math.min(outerresult.length, innerresult.length);
if(min != 0 && ((totmatch*100)/min) > 50) {
//System.out.println("maching inner key =========> "+innterKey);
dupList.add(innterKey);
}
}
//System.out.println("Duplilcate List Sisze ===================> "+dupList.size()+" "+outerkey);
}
System.out.println("End =========> "+new Date());
}
catch (Exception e) {
e.printStackTrace();
}
Here ResultSet have processed around 500000 records, but it will give me error like:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:508)
at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:406)
at java.util.HashMap.put(HashMap.java:431)
at spite.CompareRecord.main(CompareRecord.java:91)
I know this error comes because of VM memory, but don't know how to increase it in Eclipse?
What do I do if I have to process even more than 500,000 records?
In Run->Run Configuration find the Name of the class you have been running, select it, click the Arguments tab then add:
-Xms512M -Xmx1524M
to the VM Arguments section
In the Eclipse download folder make the entries in the eclipse.ini file :
--launcher.XXMaxPermSize
512M
-vmargs
-Dosgi.requiredJavaVersion=1.5
-Xms512m
-Xmx1024m
or what ever values you want.
See http://blog.headius.com/2009/01/my-favorite-hotspot-jvm-flags.html
-Xms and -Xmx set the minimum and maximum sizes for the heap. Touted as a feature, Hotspot puts a cap on heap size to prevent it from blowing out your system. So once you figure out the max memory your app needs, you cap it to keep rogue code from impacting other apps. Use these flags like -Xmx512M, where the M stands for MB. If you don't include it, you're specifying bytes. Several flags use this format. You can also get a minor startup perf boost by setting minimum higher, since it doesn't have to grow the heap right away.
-XX:MaxPermSize=###M sets the maximum "permanent generation" size. Hotspot is unusual in that several types of data get stored in the "permanent generation", a separate area of the heap that is only rarely (or never) garbage-collected. The list of perm-gen hosted data is a little fuzzy, but it generally contains things like class metadata, bytecode, interned strings, and so on (and this certainly varies across Hotspot versions). Because this generation is rarely or never collected, you may need to increase its size (or turn on perm-gen sweeping with a couple other flags). In JRuby especially we generate a lot of adapter bytecode, which usually demands more perm gen space.
How to give your program more memory when running from Eclipse
Go to Run / Run Configurations. Select the run configuration for your program. Click on the tab "Arguments". In the "Program arguments" area, add a -Xmx argument, for example -Xmx2048m to give your program a max. of 2048 MB (2 GB) memory.
How to prevent this memory problem
Re-write your program in such a way that it does not need to store so much data in memory. I haven't looked at your code in detail, but it looks like you're storing a lot of data in a HashMap; more than fits in memory when you have a lot of records.
You may want to close open projects in your Project Explorer. You can do this by right-clicking an open project, and selecting "Close Project".
This is what worked for me after I noticed that I had to periodically increase the heap size in my_directory\eclipse\eclipse.ini.
Hope this helps! Peace!
To increase the Heap size in eclipse change the eclipse.ini file.
refer to
http://wiki.eclipse.org/FAQ_How_do_I_increase_the_heap_size_available_to_Eclipse%3F
You can increase the size of the memory through the use of commandline arguments.
See this link.
eclipse -vmargs -Xmx1024m
Edit: Also see see this excellent question
In eclipse.ini file, make below entries
-Xms512m
-Xmx2048m
-XX:MaxPermSize=1024m
Please make sure your code is fine. I too got stuck in this problem once and tried the solution accepted here but in vain. So I wrote my code again. Apparently I was using a custom array list and adding the values from an array. I tried changing the ArrayList to accept the primitive values only and it worked.
What to do if i have to processed even more that 500000 records ?
There are a few ways, to increase the java heap size for your app where a few have suggested already. Your app need to remove the elements from your adddressMap as your app add new element into it and so you won't encounter oom if there are more records coming in. Look for producer-consumer if you are interested.
I am not pro in Java but your problem can be solved by "blockingqueue" if you use it wisely.
Try to retrieve a chunk of records first, process them, and iterate the process until you complete your processing. This may help you to get rid of the OutOfMemory Exceptions.
In Eclipse goto Run->Run Configuration find the Name of the class you have been running, select it, click the Target tab then in "Additional Emulator Command Line Options" add:
-Xms512M -Xmx1524M
then click apply.
Go to "Window -> Preferences -> General -> C/C++ -> Code analysis"
and disable "Syntax and Semantics Errors -> Abstract class cannot be instantiated"
Sometimes the exception will not stop after you increase the memory in eclipse ini file. then try below option
Go to Window >> Preferences >> MyEclipse >> Java Enterprise Project >> Web Project >> Deployment Tab
Under -> Under Library Deployment Policies
UnCheck -> Jars from User Libraries
The accepted solution of modifying a Run Configuration wasn't ideal for me as I have a few different run configurations and could easily forget to do this when adding further ones in future. Also I wanted the setting to apply whenever running anything, e.g. when running JUnit tests by right-clicking and selecting "Run As" -> "JUnit Test".
The above can be achieved by modifying the JRE/JDK settings instead:-
Go to Window -> Preferences.
Select Java -> Installed JREs
Select your default JRE (which could be a JDK) and click on "Edit..."
Change the "Default VM arguments" value as required, e.g. -Xms512m -Xmx4G -XX:MaxPermSize=256M
Enter the MAT folder.
Click on the MemoryAnalyzer.ini file and change it from
-vmargs
-Xmx1024m
to
-vmargs
-Xmx4096m
and restart the MAT application. It will resolve your problem.
Related
I need another set of eyes on this.
I've written out a zip file into hundreds of gigabytes with this exact code with no modifications locally on MacOSX.
With 100% unchanged code, just deployed to an AWS instance running Ubuntu, this same code runs into Out of Memory issues (heap space).
Here's the code that's being run, streaming MyBatis to a CSV file on disk:
File directory = new File(feedDirectory);
File file;
try {
file = File.createTempFile(("feed-" + providerCode + "-"), ".csv", directory);
} catch (IOException e) {
throw new RuntimeException("Unable to create file to write feed to disk: " + e.getMessage(), e);
}
String filePath = file.getAbsolutePath();
log.info(String.format("File name for %s feed is %s", providerCode, filePath));
// output file
try (FileOutputStream out = new FileOutputStream(file)) {
streamData(out, providerCode, startDate, endDate);
} catch (IOException e) {
throw new RuntimeException("Unable to write feed to file: " + e.getMessage());
}
public void streamData(OutputStream outputStream, String providerCode, Date startDate, Date endDate) throws IOException {
try (CSVPrinter printer = CsvUtil.openPrinter(outputStream)) {
StreamingHandler<FStay> handler = stayPrintingHandler(printer);
warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, handler);
}
}
private StreamingHandler<FStay> stayPrintingHandler(CSVPrinter printer) {
StreamingHandler<FStay> handler = new StreamingHandler<>();
handler.setHandler((stay) -> {
try {
EXPORTER.writeStay(printer, stay);
} catch (IOException e) {
log.error("Issue with writing output: " + e.getMessage(), e);
}
});
return handler;
}
// The EXPORTER method
import org.apache.commons.csv.CSVPrinter;
public void writeStay(CSVPrinter printer, FStay stay) throws IOException {
List<Object> list = asList(stay);
printer.printRecord(list);
}
List<Object> asList(FStay stay) {
List<Object> list = new ArrayList<>(46);
list.add(stay.getUid());
list.add(stay.getProviderCode());
//....
return list;
}
Here's a graph of the JVM heap space (using jvisualvm) when I run this locally. I've run this consistently with of Java 8 (jdk1.8.0_51 and 1.8.0_112) locally and have gotten great results. Even written out a terabyte of data.
^ In the above, the max heap space is set to 4 gigs, and the most it ever increases to is 1.5 gigs, before going back down to around 500 MB, while streaming data to the CSV file as it's supposed to.
However, when I run this on Ubuntu with jdk 1.8.0_111, the exact same operation will not complete, running out of heap space (java.lang.OutOfMemoryError: Java heap space)
I've upped the Xmx value from 8 gigs to 16 to 25 gigs, and still run out of heap space. Meanwhile... the total size of the file is only 10 Gigs in total... which really perplexes me.
Here's what the JVisualVm graph looks like on the Ubuntu box:
I've no doubt it's the exact same code running in both environments, with the same operation being performed in each (same database server providing the same data)
The only differences I can think of at this point are:
Operating system - Ubuntu vs Mac OS X
Hosted VM in AWS vs hard metal laptop
Network speed is faster in AWS between database and Ubuntu server
JDK version is 1.8.0_111 in Ubuntu, tried 1.8.0_51 and 1.8.0_112 locally
Can anyone help shed any light on this problem?
Update
I've tried replacing all the 'try-with-resources' statements with explicit flush/close statements and no luck.
What's more, I tried to force a garbage collection on the Ubuntu box as soon as I started to see the data come in, and it had no effect-- there is something definitely stopping the heap from being collected on the Ubuntu machine... while running the exact same code on OS X let me write the full enchilada again no problem.
Update 2
In addition to the differences in the environments above, the only other difference I can think of is if the connection between the servers in AWS is so fast that it streams the data faster than it can flush the data to disk... but that still doesn't explain the issue where I only have 10 gigs of data total, and it blows up a JVM with 20 Gigs of heap space.
Is there any likelihood of there being a bug at the Ubuntu/Java level for this?
Update 3
Tried replacing the output of the CSVPrinter to use an entirely separate library (OpenCSV's CSVWriter in lieu of Apache's CSV library) and the same result occurs.
As soon as this code starts receiving data from the database, the heap starts blowing up and the garbage collector fails to reclaim any memory... but only on Ubuntu. On OS X, everything is reclaimed immediately and the heap never grows.
I've also tried flushing the stream after every write, but had no luck with that as well.
Update 4
Got the heap dump to print out, and according to this I should be looking at the database driver. Specifically the InboundDataHandler in amazon's redshift driver.
I'm using myBatis with a custom result handler. I tried setting the result handler to effectively do nothing when it gets a result (new ResultHandler<>() { // method overridden to do literally nothing}) and I know I'm not holding on to any references there.
Since it's the InboundDataHandler defined by AWS/Redshift... it makes me think it may be lower than the myBatis level... either:
Error in the SqlSessionFactory I'm setting up
Bug in the Redshift driver that only pops up in Ubuntu / AWS
Bug in the result handler I have overwritten
Here's the heap dump screenshot:
Here's where I'm setting up my SqlSessionFactoryBean:
#Bean
public javax.sql.DataSource redshiftDataSource() throws ClassNotFoundException {
log.info("Got to datasource config");
// Dynamically load driver at runtime.
Class.forName(dataWarehouseDriver);
DataSource dataSource = new DataSource();
dataSource.setURL(dataWarehouseUrl);
dataSource.setUserID(dataWarehouseUsername);
dataSource.setPassword(dataWarehousePassword);
return dataSource;
}
#Bean
public SqlSessionFactoryBean sqlSessionFactory() throws ClassNotFoundException {
SqlSessionFactoryBean factoryBean = new SqlSessionFactoryBean();
factoryBean.setDataSource(redshiftDataSource());
return factoryBean;
}
Here's the myBatis code I'm running as a test to verify that it's not me holding on to records in my ResultHandler:
warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, new ResultHandler<FStay>() {
#Override
public void handleResult(ResultContext<? extends FStay> resultContext) {
// do nothing
}
});
Is there a way I can force the SQL connection to not hang on to records or something? I'll again re-iterate that on my local machine, there is no issue with this memory leak... it only surfaces when running the code in the hosted AWS environment. And in both cases, the Database driver and server are the same.
Update 6
I think it's finally fixed. Thanks to all who pointed me in the direction of the heap dump. That helped narrow it down to the offending class in a huge way.
After that, I did some research on the AWS redshift driver, and it explicitly says that your clients should specify a limit for any operations on large data. So I found out how to do that in my myBatis configuration:
<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">
select distinct
f_stay.uid,
And this did the trick.
Mind you, this isn't necessary even when handling much larger data sets downloaded remotely from AWS (Database in AWS, code executing on laptop at home), and this shouldn't be necessary since I'm overriding the myBatis ResultHandler<> which handles each row individually and never holds on to any objects.
Yet something funky happens with the AWS redshift jdbc driver only when it's run in AWS (database in aws, code executing in AWS instance) which causes this InboundDataHandler to never release its resources, unless a fetchSize is specified.
Here's the heap of the server running now, getting much further than it ever has before in AWS, with the heap space never moving above 500Mb, and after i hit 'force gc' in jvisualvm, it shows the 'used' heap at less than 100mb:
Thanks again in a huge way to all those who helped guide this!
Finally figured out a solution.
The heap dump was the biggest aid-- it indicated the InboundDataHandler class of Amazon's RedShift/postgres JDCB driver was the prime culprit.
The code to set up the SqlSession appeared legit, so traveling over to Amazon's documentation landed this gem:
To avoid client-side out-of-memory errors when retrieving large data
sets using JDBC, you can enable your client to fetch data in batches
by setting the JDBC fetch size parameter.
We hadn't run into this before, as we stream results with custom ResultHandlers in MyBatis... but there seems to be something different when the AWS Redshift JDBC driver is running on AWS itself vs outside AWS connecting in.
Taking the guidance from the documentation, we added a 'fetchSize' to our MyBatis select query:
<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">
select distinct
f_stay.uid,
And voila! Everything worked swimmingly. This is the only change we made and the heap never went above a couple hundred MBs.
You can see in one of the above graphs where the heap goes off the charts, as soon as the data started to be received on Amazon, the heap marches right up linearly and never reclaims an ounce of heap space once it starts.
My guess is the Redshift JDBC driver is doing something different when it's in Amazon's environment for some kind of optimization... that's all I can think of to explain the behavior.
Clearly Amazon knows what's going on since they documented it up front. I may not know the full 'why' of what's happening, but at least everything is resolved in what appears to be a satisfactory way.
Thanks to all those who helped.
The xlsx package can be used to read and write Excel spreadsheets from R. Unfortunately, even for moderately large spreadsheets, java.lang.OutOfMemoryError can occur. In particular,
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java heap space
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), :
java.lang.OutOfMemoryError: GC overhead limit exceeded
(Other related exceptions are also possible but rarer.)
A similar question was asked regarding this error when reading spreadsheets.
Importing a big xlsx file into R?
The main advantage of using Excel spreadsheets as a data storage medium over CSV is that you can store multiple sheets in the same file, so here we consider a list of data frames to be written one data frame per worksheet. This example dataset contains 40 data frames, each with two columns of up to 200k rows. It is designed to be big enough to be problematic, but you can change the size by altering n_sheets and n_rows.
library(xlsx)
set.seed(19790801)
n_sheets <- 40
the_data <- replicate(
n_sheets,
{
n_rows <- sample(2e5, 1)
data.frame(
x = runif(n_rows),
y = sample(letters, n_rows, replace = TRUE)
)
},
simplify = FALSE
)
names(the_data) <- paste("Sheet", seq_len(n_sheets))
The natural method of writing this to file is to create a workbook using createWorkbook, then loop over each data frame calling createSheet and addDataFrame. Finally the workbook can be written to file using saveWorkbook. I've added messages to the loop to make it easier to see where it falls over.
wb <- createWorkbook()
for(i in seq_along(the_data))
{
message("Creating sheet", i)
sheet <- createSheet(wb, sheetName = names(the_data)[i])
message("Adding data frame", i)
addDataFrame(the_data[[i]], sheet)
}
saveWorkbook(wb, "test.xlsx")
Running this in 64-bit on a machine with 8GB RAM, it throws the GC overhead limit exceeded error while running addDataFrame for the first time.
How do I write large datasets to Excel spreadsheets using xlsx?
This is a known issue:
http://code.google.com/p/rexcel/issues/detail?id=33
While unresolved, the issue page links to a solution by Gabor Grothendieck suggesting that the heap size should be increased by setting the java.parameters option before the rJava package is loaded. (rJava is a dependency of xlsx.)
options(java.parameters = "-Xmx1000m")
The value 1000 is the number of megabytes of RAM to allow for the Java heap; it can be replaced with any value you like. My experiments with this suggest that bigger values are better, and you can happily use your full RAM entitlement. For example, I got the best results using:
options(java.parameters = "-Xmx8000m")
on the machine with 8GB RAM.
A further improvement can be obtained by requesting a garbage collection in each iteration of the loop. As noted by #gjabel, R garbage collection can be performed using gc(). We can define a Java garbage collection function that calls the Java System.gc() method:
jgc <- function()
{
.jcall("java/lang/System", method = "gc")
}
Then the loop can be updated to:
for(i in seq_along(the_data))
{
gc()
jgc()
message("Creating sheet", i)
sheet <- createSheet(wb, sheetName = names(the_data)[i])
message("Adding data frame", i)
addDataFrame(the_data[[i]], sheet)
}
With both these code fixes, the code ran as far as i = 29 before throwing an error.
One technique that I tried unsuccessfully was to use write.xlsx2 to write the contents to file at each iteration. This was slower than the other code, and it fell over on the 10th iteration (but at least part of the contents were written to file).
for(i in seq_along(the_data))
{
message("Writing sheet", i)
write.xlsx2(
the_data[[i]],
"test.xlsx",
sheetName = names(the_data)[i],
append = i > 1
)
}
Building on #richie-cotton answer, I found adding gc() to the jgc function kept the CPU usage low.
jgc <- function()
{
gc()
.jcall("java/lang/System", method = "gc")
}
My previous for loop still struggled with the original jgc function, but with extra command, I no longer run into GC overhead limit exceeded error message.
Solution for the above error:
Please use the below mentioned r - code:
detach(package:xlsx)
detach(package:XLConnect)
library(openxlsx)
And, try to import the file again and you will not get any error as it works for me.
Restart R and, before loading the R packages, insert:
options(java.parameters = "-Xmx2048m")
or
options(java.parameters = "-Xmx8000m")
You can also use gc() inside the loop if you are writing row by row. gc() stands for garbage collection. gc() can be used in any case of memory issue.
I was having issues with write.xlsx() rather than reading.... but then realised that I had accidentally been running 32bit R. Swapping it out to 64bit has fixed the issue.
I'm wondering if anyone knows of a way to import data from a "big" xlsx file (~20Mb). I tried to use xlsx and XLConnect libraries. Unfortunately, both use rJava and I always obtain the same error:
> library(XLConnect)
> wb <- loadWorkbook("MyBigFile.xlsx")
Error: OutOfMemoryError (Java): Java heap space
or
> library(xlsx)
> mydata <- read.xlsx2(file="MyBigFile.xlsx")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java heap space
I also tried to modify the java.parameters before loading rJava:
> options( java.parameters = "-Xmx2500m")
> library(xlsx) # load rJava
> mydata <- read.xlsx2(file="MyBigFile.xlsx")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java heap space
or after loading rJava (this is a bit stupid, I think):
> library(xlsx) # load rJava
> options( java.parameters = "-Xmx2500m")
> mydata <- read.xlsx2(file="MyBigFile.xlsx")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java heap space
But nothing works. Does anyone have an idea?
I stumbled on this question when someone sent me (yet another) Excel file to analyze. This one isn't even that big but for whatever reason I was running into a similar error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Based on comment by #DirkEddelbuettel in a previous answer I installed the openxlsx package (http://cran.r-project.org/web/packages/openxlsx/). and then ran:
library("openxlsx")
mydf <- read.xlsx("BigExcelFile.xlsx", sheet = 1, startRow = 2, colNames = TRUE)
It was just what I was looking for. Easy to use and wicked fast. It's my new BFF. Thanks for the tip #DirkEddelbuettel!
options(java.parameters = "-Xmx2048m") ## memory set to 2 GB
library(XLConnect)
allow for more memory using "options" before any java component is loaded. Then load XLConnect library (it uses java).
That's it. Start reading in data with readWorksheet .... and so on.
:)
I do agree with #orville jackson response & it really helped me too.
Inline to the answer provided by #orville jackson. here is the detailed description of how you can use openxlsx for reading and writing big files.
When data size is small, R has many packages and functions which can be utilized as per your requirement.
write.xlsx, write.xlsx2, XLconnect also do the work but these are sometimes slow as compare to openxlsx.
So, if you are dealing with the large data sets and came across java errors. I would suggest to have a look of "openxlsx" which is really awesome and reduce the time by 1/12th.
I've tested all and finally i was really impressed with the performance of openxlsx capabilities.
Here are the steps for writing multiple datasets into multiple sheets.
install.packages("openxlsx")
library("openxlsx")
start.time <- Sys.time()
# Creating large data frame
x <- as.data.frame(matrix(1:4000000,200000,20))
y <- as.data.frame(matrix(1:4000000,200000,20))
z <- as.data.frame(matrix(1:4000000,200000,20))
# Creating a workbook
wb <- createWorkbook("Example.xlsx")
Sys.setenv("R_ZIPCMD" = "C:/Rtools/bin/zip.exe") ## path to zip.exe
Sys.setenv("R_ZIPCMD" = "C:/Rtools/bin/zip.exe") has to be static as it takes reference of some utility from Rtools.
Note: Incase Rtools is not installed on your system, please install it first for smooth experience. here is the link for your reference: (choose appropriate version)
https://cran.r-project.org/bin/windows/Rtools/
check the options as per link below (need to select all the check box while installation)
https://cloud.githubusercontent.com/assets/7400673/12230758/99fb2202-b8a6-11e5-82e6-836159440831.png
# Adding a worksheets : parameters for addWorksheet are 1. Workbook Name 2. Sheet Name
addWorksheet(wb, "Sheet 1")
addWorksheet(wb, "Sheet 2")
addWorksheet(wb, "Sheet 3")
# Writing data in to respetive sheets: parameters for writeData are 1. Workbook Name 2. Sheet index/ sheet name 3. dataframe name
writeData(wb, 1, x)
# incase you would like to write sheet with filter available for ease of access you can pass the parameter withFilter = TRUE in writeData function.
writeData(wb, 2, x = y, withFilter = TRUE)
## Similarly writeDataTable is another way for representing your data with table formatting:
writeDataTable(wb, 3, z)
saveWorkbook(wb, file = "Example.xlsx", overwrite = TRUE)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
openxlsx package is really good for reading and writing huge data from/ in excel files and has lots of options for custom formatting within excel.
The interesting fact is that we don’t have to bother about java heap memory here.
I know this question is a bit old, but There is a good solution for this nowadays. This is a default package when you try to import excel in Rstudio with GUI and It works well in my situation.
library(readxl)
data <- read_excel(filename)
As mentioned in the canonical Excel->R question, a recent alternative which has emerged comes from the readxl package, which I've found to be quite fast, compared with, e.g. openxlsx and xlsx.
That said, there's a definite limit of spreadsheet size past which you're probably better off just saving the thing as a .csv and using fread.
I also had the same error in both xlsx::read.xlsx and XLConnect::readWorksheetFromFile. Maybe you can use RODBC::odbcDriverConnect and RODBC::sqlFetch, which uses Microsoft RODBC, which is much more efficient.
#flodel's suggestion of converting to CSV seems the most straightforward. If for whatever reason, that's not an option, you can read in the file in chunks:
require(XLConnect)
chnksz <- 2e3
s <- <sheet>
wb <- loadWorkbook(<file>, s)
tot.rows <- getLastRow(wb)
last.row =0
for (i in seq(ceiling( tot.rows / chnksz) )) {
next.batch <- readWorksheet(wb, s, startRow=last.row+i, endRow=last.row+chnksz+i)
# optionally save next.batch to disk or
# assign it to a list. See which works for you.
}
I found this thread looking for an answer to the exact same question. Rather than try to hack an xlsx file from within R what ended up working for me was to convert the file to .csv using python and then import the file into R using a standard scanning function.
Check out: https://github.com/dilshod/xlsx2csv
I want to write a program (preferably in java) that will parse and analyze a java heap dump file (created by jmap). I know there are many great tools that already do so (jhat, eclipse's MAT, and so on), but I want to analyze the heap from a specific perspective to my application.
Where can I read about the structure of the heap dump file, examples how to read it, and so on? Didn't find anything useful searching for it...
Many thanks.
The following was done with Eclipse Version: Luna Service Release 1 (4.4.1) and Eclipse Memory Analyzer Version 1.4.0
Programmatically Interfacing with the Java Heap Dump
Environment Setup
In eclipse, Help -> Install New Software -> install Eclipse Plug-in Development Environment
In eclipse, Window -> Preferences -> Plug-in Development -> Target Platform -> Add
Nothing -> Locations -> Add -> Installation
Name = MAT
Location = /path/to/MAT/installation/mat
Project Setup
File -> new -> Other -> Plug-in Project
Name: MAT Extension
Next
Disable Activator
Disable Contributions to the UI
Disable API analysis
Next
Disable template
Finish
Code Generation
Open plugin.xml
Dependencies -> Add
select org.eclipse.mat.api
Extensions -> Add
select org.eclipse.mat.report.query
right click on report.query -> New
Name: MyQuery
click "impl" to generate the class
Implementing IQuery
#CommandName("MyQuery") //for the command line interface
#Name("My Query") //display name for the GUI
#Category("Custom Queries") //list this Query will be put under in the GUI
#Help("This is my first query.") //help displayed
public class MyQuery implements IQuery
{
public MyQuery{}
#Argument //snapshot will be populated before the call to execute happens
public ISnapshot snapshot;
/*
* execute : only method overridden from IQuery
* Prints out "My first query." to the output file.
*/
#Override
public IResult execute(IProgressListener arg0) throws Exception
{
CharArrayWriter outWriter = new CharArrayWriter(100);
PrintWriter out = new PrintWriter(outWriter);
SnapshotInfo snapshotInfo = snapshot.getSnapshotInfo();
out.println("Used Heap Size: " + snapshotInfo.getUsedHeapSize());
out.println("My first query.")
return new TextResult(outWriter.toString(), false);
}
}
ctrl+shift+o will generate the correct "import" statements.
Custom queries can be accessed within the MAT GUI by accessing the toolbar's "Open Query Browser" at the top of the hprof file you have opened.
The ISnapshot interface
The most important interface one can use to extract data from a heap dump is ISnapshot. ISnapshot represents a heap dump and offers various methods for reading object and classes from it, getting the size of objects, etc…
To obtain an instance of ISnapshot one can use static methods on the SnapshotFactory class. However, this is only needed if the API is used to implement a tool independent of Memory Analyzer. If you are writing extensions to MAT, then your coding will get an instance corresponding to an already opened heap dump either by injection or as a method parameter.
Reference
Built in Command Line Utility
If you're looking to have a program generate the usual reports, there is a command line utility called ParseHeapDump available with any download of Eclipse's MAT tool. You'll be able to get useful html dumps of all the information MAT stores.
> ParseHeapDump <heap dump> org.eclipse.mat.api:suspects org.eclipse.mat.api:top_components org.eclipse.mat.api:overview #will dump out all general reports available through MAT
Hopefully this is enough information to get you started.
I'm not familiar with jhat, but Eclipse's MAT is open source. Their SVN link is available, perhaps you could look through that for their parser, perhaps even use it.
The following code is throwing OutofMemoryError on Linux 3.5 enterprise box running jdk1.6.0_14 but running fine on JDK 1.6.0_20 I am clueless why its happening.
while (rs.next()) {
for (TableMetaData tabMeta : metaList) {
rec.append(getFormattedString(rs, tabMeta));
}
rec.append(lf);
recCount++;
if (recCount % maxRecBeforWrite == 0) {
bOutStream.write(rec.toString().getBytes());
rec = null;
rec = new StringBuilder();
}
}
bOutStream.write(rec.toString().getBytes());
The getFormattedString() method goes here:
private String getFormattedString(ResultSet rs, TableMetaData tabMeta)
throws SQLException, IOException {
String colValue = null;
// check if it is a CLOB column
if (tabMeta.isCLOB()) {
// Column is a CLOB, so fetch it and retrieve first clobLimit chars.
colValue = String.format("%-" + clobLimit + "s", getCLOBString(rs,
tabMeta));
} else {
colValue = String.format("%-" + tabMeta.getColumnSize() + "s", rs
.getString(tabMeta.getColumnName()));
}
return colValue;
}
Below is the exception trace:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Formatter$FormatSpecifier.justify(Formatter.java:2827)
at java.util.Formatter$FormatSpecifier.print(Formatter.java:2821)
at java.util.Formatter$FormatSpecifier.printString(Formatter.java:2794)
at java.util.Formatter$FormatSpecifier.print(Formatter.java:2677)
at java.util.Formatter.format(Formatter.java:2433)
at java.util.Formatter.format(Formatter.java:2367)
at java.lang.String.format(String.java:2769)
at com.boa.cpal.cpal2repnet.main.CPALToReportNet.getFormattedString(Unknown Source)
I suspect that the use of String.format is the culprit, but not sure. How to overcome this issue?
Please note that this code has been written to query on the database that have huge tables to read the resultset and create extract files with specific formatting.
The exception you are getting refers to the GC overhead limit that is enabled by this HotSpot option:
-XX:+UseGCOverheadLimit -Use a policy that limits the proportion of the VM's time that is spent in GC before an OutOfMemory error is thrown. (Introduced in 6.)
So, my best guess is that your application is simply running out of heap space. As #Andreas_D's answer says, the default heap sizes were changed between jdk1.6.0_14 and JDK 1.6.0_20, and that could explain the different behaviour. Your options are:
Upgrade to the later JVM. (UPDATE - 2012/06 even JDK 1.6.0_20 is now very out of date. Later 1.6 and 1.7 releases have numerous security fixes.)
Explicitly set the heap dimensions -Xmx and -Xms options when launching the JVM. If you are already doing this (on the older JVM), increase the numbers so that the maximum heap size is larger.
You could also adjust the GC overhead limit, but that's probably a bad idea on a production server.
If this particular problem only happens after your server has been running for some time, then maybe you've got memory leaks.
Garbage collection has been improved significantly with JDK 1.6.0_18:
In the Client JVM, the default Java heap configuration has been modified to improve the performance of today's rich client applications. Initial and maximum heap sizes are larger and settings related to generational garbage collection are better tuned.
A quick look at the details in this release notes makes me believe that this is why you have less problems with 1.6.0_20.
The following part of the code is not consistent with the comment:
// Column is a CLOB, so fetch it and retrieve first clobLimit chars.
colValue = String.format("%-" + clobLimit + "s", getCLOBString(rs, tabMeta));
colValue doesn't get the first clobLimit bytes from the CLOB, it is left-justfied at the column clobLimit. I wasn't sure and tried
System.out.println(String.format("%-5s", "1234567890"));
and the output was
1234567890
To achieve what you tell in the comment, you can use the simpler form:
colValue = getCLOBString(rs, tabMeta).substring(0, clobLimit);