Parsing a xml, exponential increase in time - java

I have a parser which parses and collects the requires fields and constructs a object out of it.
Suppose if xml is like below
<xml>
<p1>
...
...
</p1>
<p2>
...
</p2>
...
...
</xml>
My java code parses it and code seems like below.
for each product //p1,p2 etc..
print start time
parse that node, which returns a object
print end time
add the object to list.
The sample code is below
products = (NodeList) xPath.evaluate("/xml/product",pxml,XPathConstants.NODESET);
for (int i = 0; i < products.getLength(); i++)
{
System.out.println("parsing product ::"+i+":" + (System.currentTimeMillis()-time));
BookDataInfo _parsedPoduct = ParseProduct(products.item(i));
System.out.println("parsing product finished ::"+i+":" + (System.currentTimeMillis()-time));
if (_parsedPoduct.getParsingSucceeded())
{
pparsedProducts.add(_parsedPoduct);
}
}
I have printed the times before parsing the node and after that, the time is exponentially increasing with no.of products like for 1st product takes 100ms where as some 300th product takes 2000ms.
In each case same part of code is executed for parsing.
Could any one have idea why it happens?
I can't post the code what parseproduct is doing but found out where the time is consumed most.
private NodeList getNodelist(Node xml, String Name)
{
long time = System.currentTimeMillis();
System.out.println("Nodelist start::" + (System.currentTimeMillis() - time));
NodeList nodes = (NodeList)xPath.evaluate(Name,xml,XPathConstants.NODESET);
System.out.println("Nodelist end::" + (System.currentTimeMillis() - time));
return nodes;
}
similarly for getting node value at a stmt
Node node = (Node)xPath.evaluate(Name,xml,XPathConstants.NODE);
here xPath is a static object of type XPath.
when multiple times the above function is called for a product, the later calls are taking much time, like in start it took 2/3 ms but later(say product 300) it took 55-60ms for each call.
May I am missing some thing here?
Thanks!

check out the difference between DOM and SAX parsing, DOM lets you query the XML file but have to upload entire document to memory for that, if you just want to create objects you better use SAX parser

The problem is solved.
The main issue is the one mentioned in below link.
XPath.evaluate performance slows down (absurdly) over multiple calls
Followed the steps mentioned in that, it drastically reduced the time consumed.

Related

Check for substring efficiently for large data sets

I have:
a database table with 400 000 000 rows (Cassandra 3)
a list of circa 10 000 keywords
both data sets are expected to grow in time
I need to:
check if a specified column contains a keyword
sum how many rows contained the keyword in the column
Which approach should I choose?
Approach 1 (Secondary index):
Create secondary SASI index on the table
Find matches for given keyword "on fly" anytime
However, I am afraid of
cappacity problem - secondary indices can consume extra space and for such large table it could be too much
performance - I am not sure if finding of keyword among hundreds milions of rows can be achieved in a reasonable time
Approach 2 (Java job - brute force):
Java job that continuously iterates over data
Matches are saved into cache
Cache is updated during the next iteration
// Paginate throuh data...
String page = null;
do {
PagingState state = page == null ? null : PagingState.fromString(page);
PagedResult<DataRow> res = getDataPaged(query, status, PAGE_SIZE, state);
// Iterate through the current page ...
for (DataRow row : res.getResult()) {
// Skip empty titles
if (row.getTitle().length() == 0) {
continue;
}
// Find match in title
for (String k : keywords) {
if (k.length() > row.getTitle().length()) {
continue;
}
if (row.getTitle().toLowerCase().contains(k.toLowerCase()) {
// TODO: SAVE match
break;
}
}
}
status = res.getResult();
page = res.getPage();
// TODO: Wait here to reduce DB load
} while (page != null);
Problems
It could be very slow to iterate through whole table. If I waited for one second per every 1000 rows, then this cycle would finish in 4.6 days
This would require extra space for cache; moreover, frequent deletions from cache would produce tombstones in Cassandra
A better way will be to use a search engine like SolR our ElasticSearch. Full text search is their speciality. You could easily dump your data from cassandra to Elasticsearch and implement your java job on top of ElasticSearch.
EDIT:
With Cassandra you can request your result query as a JSON and Elasticsearch 'speak' only in JSON so you will be able to transfer your data very easily.
Elasticsearch
SolR

Android : Retrieve multiples Elements from Html using JSoup

I want to retrieve a title from a div, a start hour and an end hour all of that from a big div called day and inside another div called event
I need to had these items to a list but right now i'am stuck here because it can't retrieve my 3 elements.
Document doc = Jsoup.connect("http://terry.gonguet.com/cal/?g=tp11").get();
Elements days = doc.select("div[class=day]");
Elements event = doc.select("div[class=event]");
for(Element day : days)
{
System.out.println(" : " + day.text());
for(Element ev : event)
{
Element title = ev.select("div[class=title]").first();
Element starthour = ev.select("div[class=bub right top]").first();
Element endhour = ev.select("div[class=bub right bottom]").first();
System.out.println(title.text()+starthour.text()+endhour.text());
}
}
None of there is no div in that document which have only day as class argument. They all have day class combined with another class which prevents div[class=day] from finding such div. Same problem applies to div[class=event] selector.
To solve it use CSS query syntax in which . operator is used to describe class attribute
(hint: if you want to select element which has few classes you can use element.class1.class2).
So instead of
select("div[class=day]");
select("div[class=event]");
use
select("div.day");
select("div.event");
Also instead of
ev.select("div[class=bub right top]");
ev.select("div[class=bub right bottom]");
you could try using
ev.select("div.bub.right.top");
ev.select("div.bub.right.bottom]");
This will allow you to find div which has all these classes (even if they are not in same order or there are more classes then mentioned in selector).

Xpath Contains Query on child node to return Parent node

String query = null;
XPathExpression expr;
Object Result = null;
expr = xpath.compile("//table/column[contains(translate(text(),'ABCDEFGHIJKLMNAOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'),'"+query+"')]//text()");
result = expr.evaluate(doc, XpathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i<nodes.getLength(); i++){
System.out.println(nodes.item(i).getParentNode().getNodeName() + "" + nodes.item(i).getNodeValue());
}
Hi, so I want to start off by saying I am new to xpath and fairly new to java. I am trying to create a query interface for this large xml file and this is what I have come up with so far. The xml file is full of logs and it is setup somewhat like this....
<database>
<table>
<column1>
<column2>
<column3>
.....
The code works well at pulling back the column that match the search term however I would like it to pull back the whole table and then print it out. this will give me more valuable info including the date stamp the person who entered it ect.... I have tried various things from trying to get the parentnode from nodes(i) then putting the .getchildnodes into another nodelist but that didn't work at all. I also tried adding /.. at the end of the xpath before the text() to see if it would give me back the parent but that ended up just giving me the root tag somehow. I think I am kinda close, maybe not I don't know but if anyone can help that would be much appreciated, I have been stuck on this for a while now.
I think you want to use .. instead of //
For example to get all table with a td value of xxx you could use
//table/tr/td[text() = 'xxx']/../..

elasticsearch: Cannot find indexed data (unit Node is closed)

I'm trying to start using elasticsearch (having been a long-term compass user) and I'm having some pretty serious problems with the basics, which is highly frustrating.
The current problem I'm facing is that indexed data is not showing up until after the node is closed. Here is a sample of my code
Node node = nodeBuilder().node();
Client client = node.client();
client.prepareIndex("index1", "type1", "1").setSource("{ \"name\": \"Aaron\"}").execute().actionGet();
client.prepareIndex("index1", "type1", "2").setSource("{ \"name\": \"Andrew\"}").execute().actionGet();
client.prepareIndex("index1", "type1", "3").setSource("{ \"name\": \"Alistair\"}").execute().actionGet();
QueryBuilder queryBuilder = QueryBuilders.wildcardQuery("name", "a*");
SearchRequestBuilder searchRequestBuilder = client.prepareSearch("index1");
searchRequestBuilder.setTypes("type1");
searchRequestBuilder.setSearchType(SearchType.DEFAULT);
searchRequestBuilder.setQuery(queryBuilder);
SearchResponse response = searchRequestBuilder.execute().actionGet();
System.out.println("Response contains " + response.getHits().totalHits() + " hits");
for (SearchHit currentHit : response.getHits())
{
System.out.println(currentHit.getSourceAsString());
}
client.close();
node.close();
The first time I run this, it finds no hits in the search. However, if I run it again - it does indeed find the names that all begin with the letter "A" (don't get me started on the auto-lowercasing of indexed items, but not of searches - that cost me over an hour).
If I remover the close, it doesn't matter how many times I run the above I never find results. However, if I add the close statements, it works second time (every time).
It feels like something to do with having buffered index changes that aren't flushed?
I am sure I am missing something obvious and basic. But I just cannot put my finger on it.
You want to refresh the index before you'll be able to search for the latest changes. Put this after the indexing, before executing the search:
client.admin().indices().prepareRefresh("index1").execute().actionGet();
With the default settings, Elasticsearch will call refresh periodically every 1 second.

Java heap space errors using bigger amounts of data in neo4j

I am currently evaluating neo4j in terms of inserting big amounts of nodes/relationships into the graph. It is not about initial inserts which could be achieved with batch inserts. It is about inserts that are processed frequently during runtime in a java application that uses neo4j in embedded mode (currently version 1.8.1 as it is shipped with spring-data-neo4j 2.2.2.RELEASE).
These inserts are usually nodes that follow the star schema. One single node (the root node of the imported dataset) has up to 1000000 (one million!) connected child nodes. The child nodes normally have relationships to other additional nodes, too. But those relationships are not covered by this test so far. The overall goal is to import that amount of data in at most five minutes!
To simulate such kind of inserts I wrote a small junit test that uses the Neo4jTemplate for creating the nodes and relationships. Each inserted leaf has a key associated for later processing:
#Test
#Transactional
#Rollback
public void generateUngroupedNode()
{
long numberOfLeafs = 1000000;
Assert.assertTrue(this.template.transactionIsRunning());
Node root = this.template.createNode(map(NAME, UNGROUPED));
String groupingKey = null;
for (long index = 0; index < numberOfLeafs; index++)
{
// Just a sample division of leafs to possible groups
// Creates keys to be grouped by to groups containing 2 leafs each
if (index % 2 == 0)
{
groupingKey = UUID.randomUUID().toString();
}
Node leaf = this.template.createNode(map(GROUPING_KEY, groupingKey, NAME, LEAF));
this.template.createRelationshipBetween(root, leaf, Relationships.LEAF.name(),
map());
}
}
For this test I use the gcr cache to avoid Garbage Collector issues:
cache_type=gcr
node_cache_array_fraction=7
relationship_cache_array_fraction=5
node_cache_size=400M
relationship_cache_size=200M
Additionally I set my MAVEN_OPTS to:
export MAVEN_OPTS="-Xmx4096m -Xms2046m -XX:PermSize=256m -XX:MaxPermSize=512m -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit"
But anyway when running that test I always get a Java heap space error:
java.lang.OutOfMemoryError: Java heap space
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
at java.lang.Class.getMethod0(Class.java:2670)
at java.lang.Class.getMethod(Class.java:1603)
at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:896)
at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:862)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:859)
at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:423)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
at org.springframework.transaction.support.TransactionTemplate.<init>(TransactionTemplate.java:67)
at org.springframework.data.neo4j.support.Neo4jTemplate.exec(Neo4jTemplate.java:403)
at org.springframework.data.neo4j.support.Neo4jTemplate.createRelationshipBetween(Neo4jTemplate.java:367)
I did some tests with fewer amounts of data which result into the following outcomes. 1 node connected to:
50000 leafs: 3035ms
100000 leafs: 4290ms
200000 leafs: 10268ms
400000 leafs: 20913ms
800000 leafs: Java heap space
Here is a screenshot of the system monitor during those operations:
To get a better impression on what exactly is running and is stored in the heap I ran the JProfiler with the last test (800000 leafs). Here are some screenshots:
Heap usage:
CPU usage:
The big question for me is: Is neo4j not designed for using that kind of huge amount of data? Or are there some other ways to achieve those kind of inserts (and later operations)? On the official neo4j website and various screencasts I found the information that neo4j is able to run with billions of nodes and relationships (e.g. http://docs.neo4j.org/chunked/stable/capabilities-capacity.html). I didn't find any functionalities like flush() and clean() methods that are available e.g. in JPA to keep the heap clean manually.
It would be great to be able to use neo4j with those amounts of data. Already with 200000 leafs stored in the graph I noticed a performance improvment of factor 10 and more compared to an embedded classic RDBMS. I don't want to give up the nice way of data modeling and querying those data like neo4j provides.
By just using the Neo4j core API it takes between 18 and 26 seconds to create the children, without any optimizations on my MacBook Air:
Output: import of 1000000 children took 26 seconds.
public class CreateManyRelationships {
public static final int COUNT = 1000 * 1000;
public static final DynamicRelationshipType CHILD = DynamicRelationshipType.withName("CHILD");
public static final File DIRECTORY = new File("target/test.db");
public static void main(String[] args) throws IOException {
FileUtils.deleteRecursively(DIRECTORY);
GraphDatabaseService gdb = new GraphDatabaseFactory().newEmbeddedDatabase(DIRECTORY.getAbsolutePath());
long time=System.currentTimeMillis();
Transaction tx = gdb.beginTx();
Node root = gdb.createNode();
for (int i=1;i<= COUNT;i++) {
Node child = gdb.createNode();
root.createRelationshipTo(child, CHILD);
if (i % 50000 == 0) {
tx.success();tx.finish();
tx = gdb.beginTx();
}
}
tx.success();tx.finish();
time = System.currentTimeMillis()-time;
System.out.println("import of "+COUNT+" children took " + time/1000 + " seconds.");
gdb.shutdown();
}
}
And Spring Data Neo4j docs state, that it is not made for this type of task
If you are connecting 800K child nodes to one node, you are effectively creating a dense node, a.k.a. Key-Value like structure. Neo4j right now is not optimized to handle these structures effectively as all connected relationships are loaded into memory upon traversal of a node. This will be addressed by Neo4j 2.1 with configurable optimizations if you only want to load parts of relationships when touching these structures.
For the time being, I would recommend either putting these structures into indexes instead and do a lookup for the connected nodes, or balancing the dense structure along one value (e.g. build a subtree with say 100 subcategories along one of the properties on the relationships, e.g. time, see http://docs.neo4j.org/chunked/snapshot/cypher-cookbook-path-tree.html for instance.
Would that help?

Categories