How to process large XML file (9 GB) with STAX api

How to process large XML file (9 GB) with STAX api - java

I'm always getting Heap memory problem while processing huge file.Here i'm processing 9 GB xml file.
This is my code.
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
InputStream in = new FileInputStream(sourcePath);
XMLEventReader eventReader = inputFactory.createXMLEventReader(in);
Map<String, Cmt> mapCmt = new ConcurrentHashMap<String, Cmt>();
while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
if (event.isStartElement()) {
//some processing and assigning value to map
Cmt cmt = new Cmt();
//get attributes
cmt.setDetails(attribute.getValue());
mapCmt.put(someKey,cmt);
}
}
I'getting heap memory problem in iteration after some time.
Please help me to write optimized code.
Note: server have available 3 GB heap space. I can't increase server space.
I'm executing with following parameters - -Xms1024m -Xmx3g
My xml looks like this.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DatosAbonados xmlns="http://www.cnmc.es/DatosAbonados">
<DatosAbonado Operacion="1" FechaExtraccion="2015-10-08">
<Titular>
<PersonaJuridica DocIdentificacionJuridica="A84619488" RazonSocial="HERMANOS ROJAS" NombreComercial="PINTURAS ROJAS"/>
</Titular>
<Domicilio Escalera=" " Piso=" " Puerta=" " TipoVia="AVENIDA" NombreVia="MANOTERAS" NumeroCalle="10" Portal=" " CodigoPostal="28050" Poblacion="Madrid" Provincia="28"/>
<NumeracionAbonado>
<Rangos NumeroDesde="211188600" NumeroHasta="211188699" ConsentimientoGuias-Consulta="1" VentaDirecta-Publicidad="1" ModoPago="1">
<Operador RazonSocial="11888 SERVICIO CONSULTA TELEFONICA S.A." DocIdentificacionJuridica="A83519389"/>
</Rangos>
</NumeracionAbonado>
</DatosAbonado>
<DatosAbonado Operacion="1" FechaExtraccion="2015-10-08">
<Titular>
<PersonaJuridica DocIdentificacionJuridica="A84619489" RazonSocial="HERMANOS RUBIO" NombreComercial="RUBIO PELUQUERIAS"/>
</Titular>
<Domicilio Escalera=" " Piso=" " Puerta=" " TipoVia="AVENIDA" NombreVia="BURGOS" NumeroCalle="18" Portal=" " CodigoPostal="28036" Poblacion="Madrid" Provincia="28"/>
<NumeracionAbonado>
<Rangos NumeroDesde="211186000" NumeroHasta="211186099" ConsentimientoGuias-Consulta="1" VentaDirecta-Publicidad="1" ModoPago="1">
<Operador RazonSocial="11888 SERVICIO CONSULTA TELEFONICA S.A." DocIdentificacionJuridica="A83519389"/>
</Rangos>
</NumeracionAbonado>
</DatosAbonado>
</DatosAbonados>
My Cmt class is :
public class Cmt {
private List<DetailInfo> details;
public List<DetailInfo> getDetails() {
return details;
}
public void setDetails(DetailInfo detail) {
if(details == null){
details = new ArrayList<DetailInfo>();
}
this.details.add(detail);
}
}
Actually Cmt object is very less, But i have DetailInfo object for
every element. So huge no. of DetailInfo object is
created.
My Logic is this :
if (startElement.getName().getLocalPart().equals("DatosAbonado")) {
detailInfo = new DetailInfo();
Iterator<Attribute> attributes = startElement.getAttributes();
while (attributes.hasNext()) {
Attribute attribute = attributes.next();
if(attribute.getName().toString().equals("Operacion")){
detailInfo.setOperacion(attribute.getValue());
}
}
}
if (event.isEndElement()) {
EndElement endElement = event.asEndElement();
if (endElement.getName().getLocalPart().equals("DatosAbonado")) {
Cmt cmt = null;
if(mapCmt.keySet().contains(identificador)){
cmt = mapCmt.get(identificador);
} else{
cmt = new Cmt();
}
cmt.setDetails(detailInfo);
mapCmt.put(identificador, cmt);
}
}

The root of your problems is most likely this:
mapCmt.put(someKey, cmt);
You are populating a hashmap with a number of large Cmt objects. You need to do one of the following:
Process the data immediately rather than saving it in a data structure.
Write the data out to a database for later querying.
Increase the heap size.
Figure out a less "memory hungry" representation for your data.
The last two approaches don't scale though. As you increase the size of the input file, you will need progressively more memory ... until you eventually exceed the memory capacity of your execution platform.

DatosAbonnado is the killer indeed. If you have plenty of 'm this will cause your application to choke.
The approach is simply not scalable. As pointed out by Stephan C you need to process the DatosAbonnado as it arrives and not collect them in a container.
Since this is a typical scenario for which I developed LDX+ code generator, I went to the steps of:
creating an XML Schema file from XML (because you had not provided it) using: https://devutilsonline.com/xsd-xml/generate-xsd-from-xml
generate code with LDX+
This code generator is actually using SAX, and the resulting code allows you to:
serialize the complexElements to Java objects
configure how to treat 1 to many relationships (like the one you have here) at runtime
I uploaded the code here: https://bitbucket.org/lolkedijkstra/ldx-samples
To see the code navigate to the Source folder. There you'll find DatosAbonnados.
This approach really scales well (memory consumption is flat)

Related

DropwizardMetricServices doesn't submit the gauge metric to JMX for second time (after removing the first time)

DropwizardMetricServices#submit() I'm using doesn't submit the gauge metric for second time.
i.e. My use-case is to remove the gauge metric from JMX after reading it. And my application can send the same metric (with different value).
For the first time the gauge metric is submitted successfully (then my application removes it once it reads the metric). But, the same metric is not submitted the second time.
So, I'm a bit confused what would be the reason for DropwizardMetricServices#submit() not to work for the second time?
Below is the code:
Submit metric:
private void submitNonSparseMetric(final String metricName, final long value) {
validateMetricName(metricName);
metricService.submit(metricName, value); // metricService is the DropwizardMetricServices
log(metricName, value);
LOGGER.debug("Submitted the metric {} to JMX", metricName);
}
Code that reads and removes the metric:
protected void collectMetrics() {
// Create the connection
Long currTime = System.currentTimeMillis()/1000; // Graphite needs
Socket connection = createConnection();
if (connection == null){
return;
}
// Get the output stream
DataOutputStream outputStream = getDataOutputStream(connection);
if (outputStream == null){
closeConnection();
return;
}
// Get metrics from JMX
Map<String, Gauge> g = metricRegistry.getGauges(); // metricRegistry is com.codahale.metrics.MetricRegistry
for(Entry<String, Gauge> e : g.entrySet()){
String key = e.getKey();
if(p2cMetric(key)){
String metricName = convertToMetricStandard(key);
String metricValue = String.valueOf(e.getValue().getValue());
String metricToSend = String.format("%s %s %s\n", metricName, metricValue, currTime);
try {
writeToStream(outputStream, metricToSend);
// Remove the metric from JMX after successfully sending metric to graphite
removeMetricFromJMX(key);
} catch (IOException e1) {
LOGGER.error("Unable to send metric to Graphite - {}", e1.getMessage());
}
}
}
closeOutputStream();
closeConnection();
}

I think I found the issue.
As per the DropwizardMetricServices doc - https://docs.spring.io/spring-boot/docs/current/api/org/springframework/boot/actuate/metrics/dropwizard/DropwizardMetricServices.html#submit-java.lang.String-double- ,
submit() method Set the specified gauge value.
So, I think it's recommended to use DropwizardMetricServices#submit() method to only set the values of any existing gauge metric in JMX and not for adding any new metric to JMX.
So, once I replaced DropwizardMetricServices#submit() with MetricRegistry#register() (com.codahale.metrics.MetricRegistry) method to submit all my metrics it worked as expected and my metrics are readded to JMX (once they were removed by my application).
But, I'm just wondering what makes DropwizardMetricServices#submit() to only add new metrics to JMX and not any metric that's already been removed (from JMX). Does DropwizardMetricServices cache (in memory) all the metrics submitted to JMX? that makes DropwizardMetricServices#submit() method not to resubmit the metric?

Cassandra Exception

For my current project i'm using Cassandra Db for fetching data frequently. Within every second at least 30 Db requests will hit. For each request at least 40000 rows needed to fetch from Db. Following is my current code and this method will return Hash Map.
public Map<String,String> loadObject(ArrayList<Integer> tradigAccountList){
com.datastax.driver.core.Session session;
Map<String,String> orderListMap = new HashMap<>();
List<ResultSetFuture> futures = new ArrayList<>();
List<ListenableFuture<ResultSet>> Future;
try {
session =jdbcUtils.getCassandraSession();
PreparedStatement statement = jdbcUtils.getCassandraPS(CassandraPS.LOAD_ORDER_LIST);
for (Integer tradingAccount:tradigAccountList){
futures.add(session.executeAsync(statement.bind(tradingAccount).setFetchSize(3000)));
}
Future = Futures.inCompletionOrder(futures);
for (ListenableFuture<ResultSet> future : Future){
for (Row row: future.get()){
orderListMap.put(row.getString("cliordid"), row.getString("ordermsg"));
}
}
}catch (Exception e){
}finally {
}
return orderListMap;
}
My data request query is something like this,
"SELECT cliordid,ordermsg FROM omsks_v1.ordersStringV1 WHERE tradacntid = ?".
My Cassandra cluster has 2 nodes with 32 concurrent read and write thread for each and my Db schema as follow
CREATE TABLE omsks_v1.ordersstringv1_copy1 (
tradacntid int,
cliordid text,
ordermsg text,
PRIMARY KEY (tradacntid, cliordid)
) WITH bloom_filter_fp_chance = 0.01
AND comment = ''
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE'
AND caching = {
'keys' : 'ALL',
'rows_per_partition' : 'NONE'
}
AND compression = {
'sstable_compression' : 'LZ4Compressor'
}
AND compaction = {
'class' : 'SizeTieredCompactionStrategy'
};
My problem is getting Cassandra timeout exception, how to optimize my code to handle all these requests

It would be better if you would attach the snnipet of that Exception (Read/write exception). I assume you are getting read time out. You are trying to fetch a large data set on a single request.
For each request at least 40000 rows needed to fetch from Db
If you have a large record and resultset is too big, it throws exception if results cannot be returned within a time limit mentioned in Cassandra.yaml.
read_request_timeout_in_ms
You can increase the timeout but this is not a good option. It may resolve the issue (may not throw exception but will take more time to return result).
Solution: For big data set you can get the result using manual pagination (range query) with limit.
SELECT cliordid,ordermsg FROM omsks_v1.ordersStringV1
WHERE tradacntid > = ? and cliordid > ? limit ?;
Or use range query
SELECT cliordid,ordermsg FROM omsks_v1.ordersStringV1 WHERE tradacntid
= ? and cliordid >= ? and cliordid <= ?;
This will be much more faster than fetching the whole resultset.
You can also try by reducing the fetch size. Although it will return the whole resultset.
public Statement setFetchSize(int fetchSize) to check if exception is thrown.
setFetchSize controls the page size, but it doesn't control the
maximum rows returned in a ResultSet.
Another point to be noted:
What's the size of tradigAccountList?
Too many requests at a time also may lead to timeout. Large size of tradigAccountList and a lot of read requests are done at a time (load balancing of requests are handled by Cassandra and how many requests can be handled depends on cluster size and some other factors) may cause this exception .
Some related Links:
Cassandra read timeout
NoHostAvailableException With Cassandra & DataStax Java Driver If Large ResultSet
Cassandra .setFetchSize() on statement is not honoured

Android writing to CSV RAM issue

I'm collecting a bunch of sensor data in a Service, storing it into a SQL table, and when the user clicks a button I take all of that SQL data and save it to a CSV file, but I keep getting Window is full: requested allocation XXX errors showing in logcat
From a bit of googling I think this might be due to high RAM usage on my Nexus 5x?
When the user clicks the save button, the code to begin the process looks like this:
File subjectFile = new File(subjectDataDir, subNum + ".csv");
try{
dbHelper.exportSubjectData(subjectFile, subNum);
} catch (SQLException | IOException e){
mainActivity.logger.e(getActivity(), TAG, "exportSubjectData error", e);
}
Then in my DBHelper, the exportSubjectData method looks like this:
public void exportSubjectData(File outputFile, String subNum) throws IOException, SQLException {
csvWrite = new CSVWriter(new FileWriter(outputFile));
curCSV = db.rawQuery("SELECT * FROM " + DATA_TABLE_NAME + " WHERE id = " + subNum, null);
csvWrite.writeNext(curCSV.getColumnNames());
while (curCSV.moveToNext()) {
String arrStr[] = {curCSV.getString(0), curCSV.getString(1), curCSV.getString(2),
curCSV.getString(3), curCSV.getString(4), curCSV.getString(5),
curCSV.getString(6), curCSV.getString(7), curCSV.getString(8),
curCSV.getString(9), curCSV.getString(10)};
csvWrite.writeNext(arrStr);
}
csvWrite.close();
curCSV.close();
}
Firstly, is this type of problem normally caused by RAM usage?
Assuming that my problem is high RAM usage in that section of code, is there a more efficient way to do this without consuming so much memory? The table that its trying to write to CSV has over 300,000 rows and 10 columns

I guess you are using opencsv. What you can try is calling csvWrite.flush() after every x calls of csvWrite.writeNext(arrStr). That should write the data from memory to the disc.
You have to try what the best value for x is.

How do I get OS properties in java program?

Hi I need to get the details about operating system Physical memory and cpu usage and other details. I cannot pay any amount for already available APIs. I can use any free APIs or I can write my own API.
I need all the details in the below image.
In the above picture I have to get the following values
Total
Cached
Available
Free
like this all values I need.
For this I have searched a lot and got some hint. I got first value Total physical memory value using the below code.
public class MBeanServerDemo {
public MBeanServerDemo() {
super();
}
public static void main(String... a) throws Exception {
MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer();
Object attribute =
mBeanServer.getAttribute(new ObjectName("java.lang", "type", "OperatingSystem"), "TotalPhysicalMemorySize");
long l = Long.parseLong(attribute.toString());
System.out.println("Total memory: " + (l / (1024*1024)));
}
}
The below is the output for the above program
Total memory: 3293
Please help me . How do I achieve this.
Edit: I have searched a lot on google for solution and I found a lot of posts in stackoverflow.com. But in all these posts people discussed about only memory details. I need all details about Kernal(Paged and Non-Paged) etc. Please refer this post...
My Requirement
Thanks A lot.

can you please look at below API:
SIGAR API - System Information Gatherer And Reporter
https://support.hyperic.com/display/SIGAR/Home
Some examples Usage : http://www.programcreek.com/java-api-examples/index.php?api=org.hyperic.sigar.Sigar

You can use JNA which offers a lot of access to platform specific apis such like win32.dll
JNA on Github

Consider using jInterop for this task on Windows.
To get the total amount of RAM in MB:
public int getRAMSizeMB() throws JIException
{
String query = "Select * From Win32_ComputerSystem";
long size = 0;
Object[] params = new Object[]
{
new JIString(query),
JIVariant.OPTIONAL_PARAM()
};
JIVariant[] res = super.callMethodA("ExecQuery", params);
JIVariant[][] resSet = Utils.enumToJIVariantArray(res);
for (JIVariant[] resSet1 : resSet)
{
IJIDispatch ramVal = (IJIDispatch) JIObjectFactory.narrowObject(resSet1[0].getObjectAsComObject()
.queryInterface(IJIDispatch.IID));
size = ramVal.get("TotalPhysicalMemory").getObjectAsLong();
break;
}
return Math.round((size / 1024) / 1024);
To get the total amount of CPUs:
public int getCpuCount() throws JIException
{
String query = "Select NumberOfLogicalProcessors From Win32_Processor";
Object[] params = new Object[]
{
new JIString(query),
JIVariant.OPTIONAL_PARAM()
};
JIVariant[] res = super.callMethodA("ExecQuery", params);
JIVariant[][] resSet = Utils.enumToJIVariantArray(res);
for (JIVariant[] resSet1 : resSet)
{
IJIDispatch procVal = (IJIDispatch) JIObjectFactory.narrowObject(resSet1[0].getObjectAsComObject()
.queryInterface(IJIDispatch.IID));
return procVal.get("NumberOfLogicalProcessors").getObjectAsInt();
}
return -1;
}
Using these functions as a template, you can look up the other properties/functions via the MSDN website to query other values.

JSON to SSTable tool out-of-memory failure

json2sstable tool supplied with Cassandra 1.2.15 fails with out-of-memory error. Back in 2011 a similar issue was reported as bug and fixed: https://issues.apache.org/jira/browse/CASSANDRA-2189
Either I am missing some steps in the tool configuration/usage or the bug has re-emerged. Please point out what I am missing.
Repro steps:
1) Cassandra 1.2.15, one table with varchar key and one varchar column filled with random uuids, 6x10^6 records.
2) JSON file generated with sstable2json tool (~1G).
3) Cassandra restarted with new configuration (new data/cache/commit dirs, new partitioner)
4) Keyspace re-created
5) json2sstable fails after several minutes of processing:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at org.codehaus.jackson.util.TextBuffer.contentsAsString(TextBuffer.java:350)
at org.codehaus.jackson.impl.Utf8StreamParser.getText(Utf8StreamParser.java:278)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:59)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapArray(UntypedObjectDeserializer.java:165)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:51)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapArray(UntypedObjectDeserializer.java:165)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:51)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:204)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
at org.codehaus.jackson.map.deser.std.ObjectArrayDeserializer.deserialize(ObjectArrayDeserializer.java:104)
at org.codehaus.jackson.map.deser.std.ObjectArrayDeserializer.deserialize(ObjectArrayDeserializer.java:18)
at org.codehaus.jackson.map.ObjectMapper._readValue(ObjectMapper.java:2695)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1294)
at org.codehaus.jackson.JsonParser.readValueAs(JsonParser.java:1368)
at org.apache.cassandra.tools.SSTableImport.importUnsorted(SSTableImport.java:344)
at org.apache.cassandra.tools.SSTableImport.importJson(SSTableImport.java:328)
at org.apache.cassandra.tools.SSTableImport.main(SSTableImport.java:547)

From json2sstable source code, the tool loads all the records from json file into memory and sorts records by keys:
private int importUnsorted(String jsonFile, ColumnFamily columnFamily, String ssTablePath, IPartitioner<?> partitioner) throws IOException
{
int importedKeys = 0;
long start = System.currentTimeMillis();
JsonParser parser = getParser(jsonFile);
Object[] data = parser.readValueAs(new TypeReference<Object[]>(){});
keyCountToImport = (keyCountToImport == null) ? data.length : keyCountToImport;
SSTableWriter writer = new SSTableWriter(ssTablePath, keyCountToImport);
System.out.printf("Importing %s keys...%n", keyCountToImport);
// sort by dk representation, but hold onto the hex version
SortedMap<DecoratedKey,Map<?, ?>> decoratedKeys = new TreeMap<DecoratedKey,Map<?, ?>>();
for (Object row : data)
{
Map<?,?> rowAsMap = (Map<?, ?>)row;
decoratedKeys.put(partitioner.decorateKey( hexToBytes((String)rowAsMap.get("key"))), rowAsMap);
....
According to Jonathan Elis' comment in CASSANDRA-2322 issue the behavior is by design.
Thus json2sstable is not very well suited for importing production size data to Cassandra. The tool is likely to crash on large datasets.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to process large XML file (9 GB) with STAX api - java

Related

DropwizardMetricServices doesn't submit the gauge metric to JMX for second time (after removing the first time)

Cassandra Exception

Android writing to CSV RAM issue

How do I get OS properties in java program?

JSON to SSTable tool out-of-memory failure

Categories

Resources