How to properly iterate Bigquery TableResult in Java - java

I am trying to iterate the rows from TableResult using getValues() as below.
if I use getValues(), it's retrieving only the first page rows. I want to iterate all the rows using getValues() and NOT using iterateAll().
In the below code, the problem is its going infinite time. not ending. while(results.hasNextPage()) is not ending. what is the problem in the below code?
{
query = "select from aa.bb.cc";
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(query)
.setPriority(QueryJobConfiguration.Priority.BATCH)
.build();
TableResult results = bigquery.query(queryConfig);
int i = 0;
int j=0;
while(results.hasNextPage()) {
j++;
System.out.println("page " + j);
System.out.println("Data Extracted::" + i + " records");
for (FieldValueList row : results.getNextPage().getValues()) {
i++;
}
}
System.out.println("Total Count::" + results.getTotalRows());
System.out.println("Data Extracted::" + i + " records");
}
I have only 200,000 records in the source table. below is the out put and I forcefully stopped the process.
page 1
Data Extracted::0 records
page 2
Data Extracted::85242 records
page 3
Data Extracted::170484 records
page 4
Data Extracted::255726 records
page 5
Data Extracted::340968 records
page 6
Data Extracted::426210 records
page 7
Data Extracted::511452 records
page 8
Data Extracted::596694 records
.......
.......
.......
.......

In short, you need to update TableResults variable with your getNextPage() variable. If you don't update it you will always be looping the same results over and over. Thats why you are getting tons of records in your output.
If you check the following samples: Bigquery Pagination and Using Java Client Library. There are ways that we can deal with pagination results. Although not specific for single run queries.
As show on the code below, which is partially based on pagination sample, you need to use the output of getNextPage() to update results variable and proceed to perform the next iteration inside the while up until it iterates all pages but the last.
QueryRun.Java
package com.projects;
// [START bigquery_query]
import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.TableResult;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobId;
import com.google.cloud.bigquery.FieldValueList;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.BigQuery.QueryResultsOption;
import java.util.UUID;
import sun.jvm.hotspot.debugger.Page;
public class QueryRun {
public static void main(String[] args) {
String projectId = "bigquery-public-data";
String datasetName = "covid19_ecdc_eu";
String tableName = "covid_19_geographic_distribution_worldwide";
String query =
"SELECT * "
+ " FROM `"
+ projectId
+ "."
+ datasetName
+ "."
+ tableName
+ "`"
+ " LIMIT 100";
System.out.println(query);
query(query);
}
public static void query(String query) {
try {
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).build();
// Create a job ID so that we can safely retry.
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());
TableResult results = queryJob.getQueryResults(QueryResultsOption.pageSize(10));
int i = 0;
int j =0;
// get all paged data except last line
while(results.hasNextPage()) {
j++;
for (FieldValueList row : results.getValues()) {
i++;
}
results = results.getNextPage();
print_msg(i,j);
}
// last line run
j++;
for (FieldValueList row : results.getValues()) {
i++;
}
print_msg(i,j);
System.out.println("Query performed successfully.");
} catch (BigQueryException | InterruptedException e) {
System.out.println("Query not performed \n" + e.toString());
}
}
public static void print_msg(int i,int j)
{
System.out.println("page " + j);
System.out.println("Data Extracted::" + i + " records");
}
}
// [END bigquery_query]
output:
SELECT * FROM `bigquery-public-data.covid19_ecdc_eu.covid_19_geographic_distribution_worldwide` LIMIT 100
page 1
Data Extracted::10 records
page 2
Data Extracted::20 records
page 3
Data Extracted::30 records
page 4
Data Extracted::40 records
page 5
Data Extracted::50 records
page 6
Data Extracted::60 records
page 7
Data Extracted::70 records
page 8
Data Extracted::80 records
page 9
Data Extracted::90 records
page 10
Data Extracted::100 records
Query performed successfully.
As a final note, there are not official sample about pagination for queries so I'm not totally sure of the recommended way to handle pagination with java. Its not quite clear on the BigQuery for Java documentation page. If you can update your question with your approach to pagination I would appreciate.
If you have issues running the attached sample please see Using the BigQuery Java client sample, its github page and its pom.xml file inside of it and check if you are in compliance with it.

Probably I Am late in the response.
But reading on the java client guide (https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries#complete_source_code)
it says that:
Iterate over the QueryResponse to get all the rows in the results. The iterator automatically handles pagination. Each FieldList exposes the columns by numeric index or column name.
Sai that, it should be easier to simply use the iterateAll() method.
Let me know if I Am wrong.

Related

Is there a possibility to batch a select-query with SPARQL and RDF4J?

I am working with a quite large dataset (around 500Mio-Triples) stored in graphDB Free and running on my local developer machine.
I want to do some operations with the dataset with RDF4J and have to SELECT more or less the whole dataset. To do a test, I just SELECT the desired tuples. The code runs fine for the first Million tuples, after that it gets really slow since graphDB continues to allocate more RAM.
Is there the possibility to do a SELECT-Query on very big datasets and get them in batches ?
Basically I want just to "Iterate" trough some selected triples, so there should be no need to use that much RAM from graphDB. I can see that I allready get data in RDF4J before the query finishes, since it crashes (HeapSpaceError) only at about 1.4 Mio read tuples. Unfortunately somehow graphDB doesn't free the memory of the allready read tuples. Am I missing something?
Thanks a lot for your help.
ps. I allready set the usable heapSpace of graphDB to 20GB.
The RDF4J (Java) Code looks like following:
package ch.test;
import org.eclipse.rdf4j.query.*;
import org.eclipse.rdf4j.repository.RepositoryConnection;
import org.eclipse.rdf4j.repository.http.HTTPRepository;
import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
public class RDF2RDF {
public static void main(String[] args) {
System.out.println("Running RDF2RDF");
HTTPRepository sourceRepo = new HTTPRepository("http://localhost:7200/repositories/datatraining");
try {
String path = new File("").getAbsolutePath();
String sparqlCommand= Files.readString(Paths.get(path + "/src/main/resources/sparql/select.sparql"), StandardCharsets.ISO_8859_1);
int chunkSize = 10000;
int positionInChunk = 0;
long loadedTuples = 0;
RepositoryConnection sourceConnection = sourceRepo.getConnection();
TupleQuery query = sourceConnection.prepareTupleQuery(sparqlCommand);
try (TupleQueryResult result = query.evaluate()) {
for (BindingSet solution:result) {
loadedTuples++;
positionInChunk++;
if (positionInChunk >= chunkSize) {
System.out.println("Got " + loadedTuples + " Tuples");
positionInChunk = 0;
}
}
}
} catch (IOException err) {
err.printStackTrace();
}
}
}
select.sparql:
PREFIX XXX_meta_schema: <http://schema.XXX.ch/meta/>
PREFIX XXX_post_schema: <http://schema.XXX.ch/post/>
PREFIX XXX_post_tech_schema: <http://schema.XXX.ch/post/tech/>
PREFIX XXX_geo_schema: <http://schema.XXX.ch/geo/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX XXX_raw_schema: <http://schema.XXX.ch/raw/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT * WHERE {
BIND(<http://data.XXX.ch/raw/Table/XXX.csv> as ?table).
?row XXX_raw_schema:isDefinedBy ?table.
?cellStreetAdress XXX_raw_schema:isDefinedBy ?row;
XXX_raw_schema:ofColumn <http://data.XXX.ch/raw/Column/Objektadresse>;
rdf:value ?valueStreetAdress.
?cellOrt mobi_raw_schema:isDefinedBy ?row;
XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/Ort>;
rdf:value ?valueOrt.
?cellPlz mobi_raw_schema:isDefinedBy ?row;
XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/PLZ>;
rdf:value ?valuePLZ.
BIND (URI(concat("http://data.XXX.ch/post/tech/Adress/", MD5(STR(?cellStreetAdress)))) as ?iri_tech_Adress).
}
My Solution:
Using a subselect statemant which gets all "rows" first.
PREFIX mobi_post_schema: <http://schema.mobi.ch/post/>
PREFIX mobi_post_tech_schema: <http://schema.mobi.ch/post/tech/>
PREFIX mobi_geo_schema: <http://schema.mobi.ch/geo/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mobi_raw_schema: <http://schema.mobi.ch/raw/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT * WHERE {
{
SELECT ?row WHERE
{
BIND(<http://data.mobi.ch/raw/Table/Gebaeudeobjekte_August2020_ARA_Post.csv> as ?table).
?row mobi_raw_schema:isDefinedBy ?table.
}
}
?cellStreetAdress mobi_raw_schema:isDefinedBy ?row;
mobi_raw_schema:ofColumn <http://data.mobi.ch/raw/Column/Objektadresse>;
rdf:value ?valueStreetAdress.
?cellOrt mobi_raw_schema:isDefinedBy ?row;
mobi_raw_schema:ofColumn <http://data.mobi.ch/raw/Column/Ort>;
rdf:value ?valueOrt.
?cellPlz mobi_raw_schema:isDefinedBy ?row;
mobi_raw_schema:ofColumn <http://data.mobi.ch/raw/Column/PLZ>;
rdf:value ?valuePLZ.
BIND (URI(concat("http://data.mobi.ch/post/tech/Adress/", MD5(STR(?cellStreetAdress)))) as ?iri_tech_Adress).
}
I don't know immediately why the query given would be so costly, memory-wise, for GraphDB Free to execute, but generally a lot can depend on the shape and size of your dataset. Of course, doing a query that basically retrieves the entire database is not necessarily a wise thing to do in the first place.
Having said that, there's a couple of things you can try. Working with LIMIT and OFFSET as a pagination mechanism is one way.
Another option you could try is to split your query in two: one query retrieves all identifiers of resources you're interested in, and then you iterate over those and for each do a separate query to get the details (attributes and relations) for that particular resource.
In your example, you could split on ?row, so you'd first do a query to get all rows for the given table:
SELECT ?row WHERE {
VALUES ?table { <http://data.XXX.ch/raw/Table/XXX.csv> }
?row XXX_raw_schema:isDefinedBy ?table.
}
And then you iterate over that result, injecting each returned value for ?row into the query that retrieves details:
SELECT * WHERE {
VALUES ?row { <http://data.XXX.ch/raw/Table/XXX.csv#row1> }
?cellStreetAdress XXX_raw_schema:isDefinedBy ?row;
XXX_raw_schema:ofColumn <http://data.XXX.ch/raw/Column/Objektadresse>;
rdf:value ?valueStreetAdress.
?cellOrt mobi_raw_schema:isDefinedBy ?row;
XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/Ort>;
rdf:value ?valueOrt.
?cellPlz mobi_raw_schema:isDefinedBy ?row;
XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/PLZ>;
rdf:value ?valuePLZ.
BIND (URI(concat("http://data.XXX.ch/post/tech/Adress/", MD5(STR(?cellStreetAdress)))) as ?iri_tech_Adress).
}
In Java code, something like this:
String sparqlCommand1 = // the query for all rows of the table
// query for details of each row. Value of row will be injected via the API
String sparqlCommand2 = "SELECT * WHERE { \n"
+ " ?cellStreetAdress XXX_raw_schema:isDefinedBy ?row;\n"
+ " XXX_raw_schema:ofColumn <http://data.XXX.ch/raw/Column/Objektadresse>;\n"
+ " rdf:value ?valueStreetAdress.\n"
+ " ?cellOrt mobi_raw_schema:isDefinedBy ?row;\n"
+ " XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/Ort>;\n"
+ " rdf:value ?valueOrt.\n"
+ " ?cellPlz mobi_raw_schema:isDefinedBy ?row;\n"
+ " XXX_raw_schema:ofColumn <http://XXX.mobi.ch/raw/Column/PLZ>;\n"
+ " rdf:value ?valuePLZ.\n"
+ " BIND (URI(concat(\"http://data.XXX.ch/post/tech/Adress/\", MD5(STR(?cellStreetAdress)))) as ?iri_tech_Adress).\n"
+ "}";
try(RepositoryConnection sourceConnection = sourceRepo.getConnection()) {
TupleQuery rowQuery = sourceConnection.prepareTupleQuery(sparqlCommand1);
TupleQuery detailsQuery = sourceConnection.prepareTupleQuery(sparqlCommand2);
try (TupleQueryResult result = rowQuery.evaluate()) {
for (BindingSet solution: result) {
// inject the current row identifier
detailsQuery.setBinding("row", solution.getValue("row"));
// execute the details query for the row and do something with
// the result
detailsQuery.evaluate().forEach(System.out::println);
}
}
}
You're doing more queries this way of course (N+1 where N is the number of rows), but each individual query result is only a small chunk, and probably easier for GraphDB Free (as well as your own application) to manage.

Camel import data between two server

I integrate two systems and i have to insert data from one client table to another in another server, without any business logic or datamodification, onece per week. Every time when its run i have to inesrt all data. So i wrote camel configuration which i atached below. Its working for small piece of data but when clients table has over than 20000 rows i get exeption. java.lang.OutOfMemoryError: GC overhead limit exceeded. I try change java memory like "set JAVA_OPTS=-Dfile.encoding=UTF-8 -Xms2048m -Xmx16384m -XX:PermSize=1012m -XX:MaxPermSize=2048m -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit". But its not helps.
enter image description here
enter image description here
i am working on the same tool.
To query and the insert large tables i am using jdbc StreamList.
In this example i work with two datasources,
dataSource1 for query and dataSource2 for insert.
Query from dataSource1 processed as stream, then body split row by row, so not all data at once, then each row are written to file and also an prepared insert statement is build and executed for dataSource2.
private void createRoute(String tableName) {
this.from("timer://timer1?repeatCount=1") //
.setBody(this.simple("delete from " + tableName)) //
.to("jdbc:dataSource2") //
.setBody(this.simple("select * from " + tableName)) //
.to("jdbc:dataSource1?outputType=StreamList") //
// .to("log:stream") //
.split(this.body())//
.streaming() //
// .to("log:row") //
.multicast().to("direct:file", "direct:insert") //
.end().log("End");
this.from("direct:file")
// .marshal(jsonFormat) //
.marshal(new CsvDataFormat().setDelimiter(';'))//
.to("stream:file?fileName=output/" + tableName + ".txt").log("Data: ${body}");
this.from("direct:insert").process(new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
StringBuilder insert = new StringBuilder("insert into ").append(tableName).append(" (");
StringBuilder values = new StringBuilder("values(");
LinkedHashMap<?, ?> body = (LinkedHashMap<?, ?>) exchange.getIn().getBody();
Iterator<String> i = (Iterator<String>) body.keySet().iterator();
while (i.hasNext()) {
String key = i.next();
insert.append(key);
values.append(":?" + key);
exchange.getOut().getHeaders().put(key, body.get(key));
if (i.hasNext()) {
insert.append(",");
values.append(",");
} else {
insert.append(") ");
values.append(") ");
}
}
String sql = insert.append(values).toString();
exchange.getOut().setBody(sql);
}
}) //
.log("SQL: ${body}")//
.to("jdbc:dataSource2?useHeadersAsParameters=true");

How to promptly get workitem history with tfs java sdk?( with out workitem.open())

for reasons, i must use tfs java sdk, so i always need workitem.open() to load revisions for each workitem from workitemclient.Query(),like below code:
WorkItemCollection workItemCollection = workItemClient.query(wiql);
for(int i = 0; i < workItemCollection.size(); i++)
{
WorkItem workItem = workItems.get(i);
workItem.open();
workItem.getRevisions()
.forEach(r -> {
//..."System.History").getValue());
//..."System.History").getOriginalValue());
});
}
if don't use open(), Revisions must be empty;
that open() method cost 55 millseconds for each workitem,that's unacceptable。
so any suggestion for better efficiency ? Thanks a lot。
/////////////////////////////////////////////////////////////////////make it/////////////////////////////////////////////////////////////////////
i make it that cut the time.
use SQL like blow :
String sql = "select [System.Id],[System.ChangedBy],[System.ChangedDate],[words]"
+ " from Tfs_DefaultCollection.dbo.WorkItemsEverable"
+ " left join Tfs_DefaultCollection.dbo.WorkItemLongTexts on [System.Id] = [ID] and [System.Rev] = [Rev]"
+ " where FldID=54" // behalf History field
+ " and"+conditionSql;
conditionSql : " id in ('12','1324','1')"
now use this method, average cost mills for retrieve history(changedby,changedDate,History) just 2~4 mills(test case is 30 ~ 1000 workitems)
/////////////////////////////////////////////////////////////////////still/////////////////////////////////////////////////////////////////////
any advise for better code/efficiency would be appreciate。
The revision history can only be retrieved for work item fields available in the collection WorkItem.Fields, so if you used the below code to get the workitem revision history you will NOT see the history but end up reloading the current workitem object again and again.
foreach (Revision revision in wi.Revisions)
{
Debug.Write(revision.WorkItem);
}
You may try to get value of fields in the work item revision:
// Returns a work item object
var wi = GetWorkItemDetails(id);
// Get All work item revisions
foreach (Revision revision in wi.Revisions)
{
// Get value of fields in the work item revision
foreach (Field field in wi.Fields)
{
Debug.Write(revision.Fields[field.Name].Value);
}
}
Helpful blog: http://geekswithblogs.net/TarunArora/archive/2011/08/21/tfs-sdk-work-item-history-visualizer-using-tfs-api.aspx

Delay in retriving parent document using FT Search

i am using FT search on $message id field to retrieve the parent document.my database is FT indexed. i need the parent document for accepting the meeting invitation. how ever i am able to retrieve document after 2 hours of getting meeting invitation. need help.
String messageiD="<OFF0E85FF0.91FEF356-ON65257C97.00360343-65257C97.00361318#LocalDomain>";
if (messageiD.contains("#")) {
String[] strArr = messageiD.split("#");
messageiD = strArr[0].replace("<", "");
System.out.println("message id is "+messageiD);
//return messageiD;
}
String qry = "Field $MessageID CONTAINS " + messageiD;
DocumentCollection col1 = m_database.FTSearch(qry);
System.out.println("doc col length is " +col1.getCount());
Document docOld = col1.getFirstDocument();
System.out.println(docOld.getNoteID());
If you are able to retrieve the result one hour / two hours later, then the FT- Index is not up to date, when it tries to process your request. Use method updateFTIndex() of NotesDatabase- class to make sure, it is up to date. Of course you can check, if it IS up to date, using getLastFTIndexed()- Method... Here is example- code from the Designer- Help to use these two methods:
try {
Session session = getSession();
AgentContext agentContext =
session.getAgentContext();
// (Your code goes here)
Database db = agentContext.getCurrentDatabase();
String title = db.getTitle();
DateTime lastDT = db.getLastFTIndexed();
DateTime nowDT = session.createDateTime("Today");
nowDT.setNow();
int daysSince =
nowDT.timeDifference(lastDT) / 86400;
if (daysSince > 2) {
System.out.println("Database \"" + title +
"\" was last full-text indexed " +
daysSince + " days ago");
System.out.println("Updating");
db.updateFTIndex(true); }
else
System.out.println("Database \"" + title +
"\" was full-text indexed less
than two days ago");
} catch(Exception e) {
e.printStackTrace();
}
Additional Information: When creating a fulltext index for a database you define, how often this index is updated.
But: Even when selecting "Immediate" in the dialog, this does not mean, that the index will always be up to date. Updating the fulltext is a job of the Update- task of the server. If this task is to "busy" then the request is queued and might be delayed for some time until there are resources available for doing the job.
The performance of fulltextindex updates can be enhanced by the server admin by setting a notes.ini- Variable "UPDATE_FULLTEXT_THREAD" (see this link about the variable to check details).

Pelops Java Client to insert into Cassandra Database

I recently started working with Cassandra database. I was able to setup single node cluster in my local machine.
And now I was thinking to start writing some sample data to Cassandra Database using Pelops client.
Below is the keyspace and column family I have created so far-
create keyspace my_keyspace with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};
use my_keyspace;
create column family users with column_type = 'Standard' and comparator = 'UTF8Type';
Below is the code that I have so far. I made some progress as before I was getting some exception which I was able to fix it. Now I am getting another exception
public class MyPelops {
private static final Logger log = Logger.getLogger(MyPelops.class);
public static void main(String[] args) throws Exception {
// A comma separated List of Nodes
String NODES = "localhost";
// Thrift Connection Pool
String THRIFT_CONNECTION_POOL = "Test Cluster";
// Keyspace
String KEYSPACE = "my_keyspace";
// Column Family
String COLUMN_FAMILY = "users";
Cluster cluster = new Cluster(NODES, 9160);
Pelops.addPool(THRIFT_CONNECTION_POOL, cluster, KEYSPACE);
Mutator mutator = Pelops.createMutator(THRIFT_CONNECTION_POOL);
log.info("- Write Column -");
mutator.writeColumn(
COLUMN_FAMILY,
"Row1",
new Column().setName(" Name ".getBytes()).setValue(
" Test One ".getBytes()));
mutator.writeColumn(
COLUMN_FAMILY,
"Row1",
new Column().setName(" Work ".getBytes()).setValue(
" Engineer ".getBytes()));
log.info("- Execute -");
mutator.execute(ConsistencyLevel.ONE);
Selector selector = Pelops.createSelector(THRIFT_CONNECTION_POOL);
int columnCount = selector.getColumnCount(COLUMN_FAMILY, "Row1",
ConsistencyLevel.ONE);
log.info("- Column Count = " + columnCount);
List<Column> columnList = selector
.getColumnsFromRow(COLUMN_FAMILY, "Row1",
Selector.newColumnsPredicateAll(true, 10),
ConsistencyLevel.ONE);
log.info("- Size of Column List = " + columnList.size());
for (Column column : columnList) {
log.info("- Column: (" + new String(column.getName()) + ","
+ new String(column.getValue()) + ")");
}
log.info("- All Done. Exit -");
System.exit(0);
}
}
Whenever I am running this program, I am getting this exception-
Exception in thread "main" org.scale7.cassandra.pelops.exceptions.InvalidRequestException: Column timestamp is required
And this exception is coming as soon as it tries to execute this line-
mutator.execute line
As I mentioned above, I am new to Cassandra Database and Pelops client as well. This is my first time working with that. Can anyone help me with this problem with step by step process? I am running Cassandra 1.2.3 in my local box.
Any step by step guidance like how to insert data in Cassandra database will help me a lot in understanding how Cassandra works.
Thanks in advance.
Each cassandra column is a Key-Value-Timestamp triplet.
You didn't set the timestamp in yours columns
Column c = new Column();
c.setTimestamp(System.currentTimeMillis());
You can use the client way to create the column and make the job easier
mutator.writeColumn(
COLUMN_FAMILY,
"Row1",
mutator.newColumn(" Name ", " Test One "));
In this way you can avoid both setting the timestamp (the client will do it for you) and using getBytes() on String.
Regards, Carlo

Categories