mongo java driver slow compared to shell (15 times) - java

Trying to dump the _id column only.
With mongo shell it finishes in around 2 minutes:
time mongoexport -h localhost -d db1 -c collec1 -f _id -o u.text --csv
connected to: localhost
exported 68675826 records
real 2m20.970s
With java it takes about 30 minutes:
java -cp mongo-test-assembly-0.1.jar com.poshmark.Test
class Test {
public static void main(String[] args) {
MongoClient mongoClient = new MongoClient("localhost");
MongoDatabase database = mongoClient.getDatabase("db1");
MongoCollection<Document> collection = database.getCollection("collec1");
MongoCursor<Document> iterator = collection.find().projection(new Document("_id", 1)).iterator();
while (iterator.hasNext()) {
System.out.println(iterator.next().toString());
}
}
}
CPU usage on box is low, don't see any network latency issues, since both tests are running from same box
Update:
Used Files.newBufferedWriter instead of System.out.println but ended up with same performance.
Looked at db.currentOp(), makes me think that mongo is hitting disk since it is having too many numYields
{
"inprog" : [
{
"desc" : "conn8636699",
"threadId" : "0x79a70c0",
"connectionId" : 8636699,
"opid" : 1625079940,
"active" : true,
"secs_running" : 12,
"microsecs_running" : NumberLong(12008522),
"op" : "getmore",
"ns" : "users.users",
"query" : {
"_id" : {
"$exists" : true
}
},
"client" : "10.1.166.219:60324",
"numYields" : 10848,
"locks" : {
},
"waitingForLock" : false,
"lockStats" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(21696)
},
"acquireWaitCount" : {
"r" : NumberLong(26)
},
"timeAcquiringMicros" : {
"r" : NumberLong(28783)
}
},
"MMAPV1Journal" : {
"acquireCount" : {
"r" : NumberLong(10848)
},
"acquireWaitCount" : {
"r" : NumberLong(5)
},
"timeAcquiringMicros" : {
"r" : NumberLong(40870)
}
},
"Database" : {
"acquireCount" : {
"r" : NumberLong(10848)
}
},
"Collection" : {
"acquireCount" : {
"R" : NumberLong(10848)
}
}
}
}
]
}

The problem resides in STDOUT.
Printing to stdout is not inherently slow. It is the terminal you work with that is slow.
https://stackoverflow.com/a/3860319/3710490
The disk appears to be faster, because it is highly buffered.
The terminal, on the other hand, does little or no buffering: each individual print / write(line) waits for the full write (i.e. display to output device) to complete.
https://stackoverflow.com/a/3857543/3710490
I've reproduced your use case with enough similar dataset.
mongoexport to FILE
$ time "C:\Program Files\MongoDB\Server\4.2\bin\mongoexport.exe" -h localhost -d test -c collec1 -f _id -o u.text --csv
2020-03-28T13:03:01.550+0100 csv flag is deprecated; please use --type=csv instead
2020-03-28T13:03:02.433+0100 connected to: mongodb://localhost/
2020-03-28T13:03:03.479+0100 [........................] test.collec1 0/21028330 (0.0%)
2020-03-28T13:05:02.934+0100 [########################] test.collec1 21028330/21028330 (100.0%)
2020-03-28T13:05:02.934+0100 exported 21028330 records
real 2m1,936s
user 0m0,000s
sys 0m0,000s
mongoexport to STDOUT
$ time "C:\Program Files\MongoDB\Server\4.2\bin\mongoexport.exe" -h localhost -d test -c collec1 -f _id --csv
2020-03-28T14:43:16.479+0100 connected to: mongodb://localhost/
2020-03-28T14:43:16.545+0100 [........................] test.collec1 0/21028330 (0.0%)
2020-03-28T14:53:02.361+0100 [########################] test.collec1 21028330/21028330 (100.0%)
2020-03-28T14:53:02.361+0100 exported 21028330 records
real 9m45,962s
user 0m0,015s
sys 0m0,000s
JAVA to FILE
$ time "C:\Program Files\Java\jdk1.8.0_211\bin\java.exe" -jar mongo-test-assembly-0.1.jar FILE
Wasted time for [FILE] - 271,57 sec
real 4m32,174s
user 0m0,015s
sys 0m0,000s
JAVA to STDOUT to FILE
$ time "C:\Program Files\Java\jdk1.8.0_211\bin\java.exe" -jar mongo-test-assembly-0.1.jar SYSOUT > u.text
real 6m50,962s
user 0m0,015s
sys 0m0,000s
JAVA to STDOUT
$ time "C:\Program Files\Java\jdk1.8.0_211\bin\java.exe" -jar mongo-test-assembly-0.1.jar SYSOUT > u.text
Wasted time for [SYSOUT] - 709,33 sec
real 11m51,276s
user 0m0,000s
sys 0m0,015s
Java code
long init = System.currentTimeMillis();
try (MongoClient mongoClient = new MongoClient("localhost");
BufferedWriter writer = Files.newBufferedWriter(Files.createTempFile("benchmarking", ".tmp"))) {
MongoDatabase database = mongoClient.getDatabase("test");
MongoCollection<Document> collection = database.getCollection("collec1");
MongoCursor<Document> iterator = collection.find().projection(new Document("_id", 1)).iterator();
while (iterator.hasNext()) {
if ("SYSOUT".equals(args[0])) {
System.out.println(iterator.next().get("_id"));
} else {
writer.write(iterator.next().get("_id") + "\n");
}
}
} catch (Exception e) {
e.printStackTrace();
}
long end = System.currentTimeMillis();
System.out.println(String.format("Wasted time for [%s] - %.2f sec", args[0], (end - init) / 1_000.));

Related

scala MongoDB update with $cond and $not (not display the same result)

If you can help me, I have an update in mongo with $cond , this update is only done if the field is empty, otherwise it updates the field with another value. Example in mongo db
I want to update the field camp1
if camp1 no exits = values
if camp1 exits = value2
db.getCollection('prueba').update(
{"cdAccount": "ES3100810348150001326934"},
[{$set:{camp1 :{"$cond": [{"$not": ["$camp1"]}, "values", "value2"]}}}]);
Result:
{
"_id" : ObjectId("62dd08c3f9869303b79b323b"),
"cdAccount" : "ES3100810348150001326934",
"camp1" : "value2"
}
Now I do the same in scala with this code
def appendIfNotNull(key: String,value :Object) = {
var eq2Array = new util.ArrayList[Object]()
eq2Array.add("$"+key)
val eq2Op = new Document("$not", eq2Array)
var condList = new util.ArrayList[Object]()
condList.add(eq2Op)
condList.add(value.asInstanceOf[AnyRef])
//condList.add("$"+key)
condList.add("value2")
val availDoc =
new Document("$cond",
new Document("$cond", condList)).toBsonDocument(classOf[BsonDocument],getCodecRegistry).get("$cond")
println("availDoc : " + availDoc)
documentGrab.append(key,availDoc)
}
val finalVar = appendIfNotNull("camp1","values")
println("finalVar : " + finalVar)
availDoc : {"$cond": [{"$not": ["$camp1"]}, "values", "value2"]}
finalVar : Document{{camp1={"$cond": [{"$not": ["$camp1"]}, "values", "value2"]}}}
val updateDocument = new Document("$set" , finalVar )
println("updateDocument : " + updateDocument)
collectionA.updateMany(Filters.eq("cdAccount", "ES3100810348150001326934"),updateDocument)
The only difference I see is that in mongodb the "[" is added at the beginning of the $set and it does it well
MongoDB
[ {$set:{camp1 :{"$cond": [{"$not": ["$camp1"]}, "values", "value2"]}}} ] --> Ok Update
Scale
{$set:{camp1 :{"$cond": [{"$not": ["$camp1"]}, "values", "value2"]}}} --> Ok in scala , but I get the result II
I am using mongodb 5.0.9
Now in mongodb I execute the statement made in scala
db.getCollection('prueba').update(
{"cdAccount": "ES3100810348150001326934"},
{$set :{camp1 :{"$cond": [{"$not": ["$camp1"]}, "values", "value2"]}}});
When I run it in scala the same thing happens
Result II :
{
"cdAccount" : "ES3100810348150001326934",
"camp1" : {
"$cond" : [
{
"$not" : [
"$camp1"
]
},
"values",
"value2"
]
}
}
Can someone tell me how to fix it?
Thank you so much
You see the very important difference when priting the queries.
$cond is an aggregation pipeline operator. It is processed only when aggregation pipeline is used to update the data. When a simple (non-pipelined) update is used, the operator has no special meaning and this is exactly what you see in the output.
You indicate "pipeline update" by passing an array instead of simple object as update description in javascript API (and mongo console). In Scala/Java you have to use one of the updateMany overloads that takes update description as List, not Bson. I.e. you need something like
collectionA.updateMany(
Filters.eq("cdAccount", "ES3100810348150001326934"),
Collections.singletonList(updateDocument)
)

No output of Perl script called from Java

I am using Ganymed ssh lib in Java to connect to Linux machine and execute some unix scripts, and display their output.
I am running a parent shell script which in turn runs few other sub-scripts and finally a perl script. All work well for the shell scripts, but when it reaches the perl script, I stop getting any output.
If I run the parent script manually on Linux server I see output from perl without issues.
Here's the relevant java code, connecting to the machine and calling the shell script, and returning a BufferedReader from where the output could be read line by line :
try {
conn = new Connection(server);
conn.connect();
boolean isAuthenticated = conn.authenticateWithPublicKey(user, keyfile, keyfilePass);
if (isAuthenticated == false) {
throw new IOException("Authentication failed.");
}
sess = conn.openSession();
if (param == null) {
sess.execCommand(". ./.bash_profile; cd $APP_HOME; ./parent_script.sh");
}
else {...}
InputStream stdout = new StreamGobbler(sess.getStdout());
reader = new BufferedReader(new InputStreamReader(stdout));
} catch (IOException e) {
e.printStackTrace();
}
My parent shell script called looks like this :
./start1 #script1 output OK
./start2 #script2 output OK
./start3 #script3 output OK
/u01/app/perl_script.pl # NO OUTPUT HERE :(
Would anyone have any idea why this happens ?
EDIT : Adding perl script
#!/u01/app/repo/code/sh/perl.sh
use FindBin qw/ $Bin /;
use File::Basename qw/ dirname /;
use lib (dirname($Bin). "/pm" );
use Capture::Tiny qw/:all/;
use Data::Dumper;
use Archive::Zip;
use XML::Simple;
use MXA;
my $mx = new MXA;
chdir $mx->config->{$APP_HOME};
warn Dumper { targets => $mx->config->{RTBS} };
foreach my $target (keys %{ $mx->config->{RTBS}->{Targets} }) {
my $cfg = $mx->config->{RTBS}->{Targets}->{$target};
my #commands = (
[
...
],
[
'unzip',
'-o',
"$cfg->{ConfigName}.zip",
'Internal/AdapterConfig/Driver.xml'
],
[
'zip',
"$cfg->{ConfigName}.zip",
'Internal/AdapterConfig/Driver.xml'
],
[
'rm -rf Internal'
],
[
"rm -f $cfg->{ConfigName}.zip"
],
);
foreach my $cmnd (#commands) {
warn Dumper { command => $cmnd };
my ($stdout, $stderr, $status) = capture {
system(#{ $cmnd });
};
warn Dumper { stdout => $stdout,
stderr => $stderr,
status => $status };
}
=pod
warn "runnnig -scriptant /ANT_FILE:mxrt.RT_${target}argets_std.xml /ANT_TARGET:exportConfig -jopt:-DconfigName=Fixing -jopt:-DfileName=Fixing.zip');
($stdout, $stderr, $status) = capture {
system('./command.app', '-scriptant -parameters');
}
($stdout, $stderr, $status) = capture {
system('unzip Real-time.zip Internal/AdapterConfig/Driver.xml');
};
my $xml = XMLin('Internal/AdapterConfig/MDPDriver.xml');
print Dumper { xml => $xml };
[[ ${AREAS} == "pr" ]] && {
${PREFIX}/substitute_mdp_driver_parts Internal/AdapterConfig/Driver.xml 123 controller#mdp-a-n1,controller#mdp-a-n2
} || {
${PREFIX}/substitute_mdp_driver_parts Internal/AdapterConfig/Driver.xml z8pnOYpulGnWrR47y5UH0e96IU0OLadFdW/Bm controller#md-uat-n1,controller#md-uat-n2
}
zip Real-time.zip Internal/AdapterConfig/Driver.xml
rm -rf Internal
rm -f Real-time.zip
print $mx->Dump( stdout => $stdout,
stderr => $stderr,
status => $status );
=cut
}
The part of your Perl code that produces the output is:
warn Dumper { stdout => $stdout,
stderr => $stderr,
status => $status };
Looking at the documentation for warn() we see:
Emits a warning, usually by printing it to STDERR
But your Java program is reading from STDOUT.
InputStream stdout = new StreamGobbler(sess.getStdout());
You have a couple of options.
Change your Perl code to send the output to STDOUT instead of STDERR. This could be a simple as changing warn() to print().
When calling the Perl program in the shell script, redirect STDERR to STDOUT.
/u01/app/perl_script.pl 2>&1
I guess you could also set up your Java program to read from STDERR as well. But I'm not a Java programmer, so I wouldn't be able to advise you on the best way to do that.

java.lang.OutOfMemoryError: Java heap space when transferring data from jdbc to elasticsearch via logstash [duplicate]

This question already has answers here:
How to deal with "java.lang.OutOfMemoryError: Java heap space" error?
(31 answers)
Closed 1 year ago.
I have a huge postgres database with 20 million rows and i want to transfer it to elasticsearch via logstash . I followed the advice mentioned here and I test it for a simple database with 300 rows and all things worked fine but when i tested it for my main database i allways cross with error:
nargess#nargess-Surface-Book:/usr/share/logstash/bin$ sudo ./logstash -w 1 -f students.conf --path.data /usr/share/logstash/data/students/ --path.settings /etc/logstash
Sending Logstash's logs to /var/log/logstash which is now configured via log4j2.properties
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid3453.hprof ...
Heap dump file created [13385912484 bytes in 53.304 secs]
Exception in thread "Ruby-0-Thread-11: /usr/share/logstash/vendor/bundle/jruby/1.9/gems/puma-2.16.0-java/lib/puma/thread_pool.rb:216" java.lang.ArrayIndexOutOfBoundsException: -1
at org.jruby.runtime.ThreadContext.popRubyClass(ThreadContext.java:729)
at org.jruby.runtime.ThreadContext.postYield(ThreadContext.java:1292)
at org.jruby.runtime.ContextAwareBlockBody.post(ContextAwareBlockBody.java:29)
at org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:198)
at org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)
at org.jruby.runtime.Block.call(Block.java:101)
at org.jruby.RubyProc.call(RubyProc.java:300)
at org.jruby.RubyProc.call(RubyProc.java:230)
at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:103)
at java.lang.Thread.run(Thread.java:748)
The signal INT is in use by the JVM and will not work correctly on this platform
Error: Your application used more memory than the safety cap of 12G.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace
Although I go to file /etc/logstash/jvm.options and set -Xms256m
-Xmx12000m, but I have had these errors yet. I have 13g memory free. how can i send my data to elastic search with this memory ?
this is the student-index.json that i use in elasticsearch
{
"aliases": {},
"warmers": {},
"mappings": {
"tab_students_dfe": {
"properties": {
"stcode": {
"type": "text"
},
"voroodi": {
"type": "integer"
},
"name": {
"type": "text"
},
"family": {
"type": "text"
},
"namp": {
"type": "text"
},
"lastupdate": {
"type": "date"
},
"picture": {
"type": "text"
},
"uniquename": {
"type": "text"
}
}
}
},
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1"
}
}
}
then i try to insert this index in elastic search by :
curl -XPUT --header "Content-Type: application/json"
http://localhost:9200/students -d #postgres-index.json
and next, this is my configuration fil in /usr/shar/logstash/bin/students.conf file :
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://localhost:5432/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres"
# The path to downloaded jdbc driver
jdbc_driver_library => "./postgresql-42.2.1.jar"
jdbc_driver_class => "org.postgresql.Driver"
# The path to the file containing the query
statement => "select * from students"
}
}
filter {
aggregate {
task_id => "%{stcode}"
code => "
map['stcode'] = event.get('stcode')
map['voroodi'] = event.get('voroodi')
map['name'] = event.get('name')
map['family'] = event.get('family')
map['namp'] = event.get('namp')
map['uniquename'] = event.get('uniquename')
event.cancel()
"
push_previous_map_as_event => true
timeout => 5
}
}
output {
elasticsearch {
document_id => "%{stcode}"
document_type => "postgres"
index => "students"
codec => "json"
hosts => ["127.0.0.1:9200"]
}
}
Thank you for your help
This is a bit old, but I just had the same issue and increasing the heap size of logstash helped me here. I added this to my logstash service in the docker-compose file:
environment:
LS_JAVA_OPTS: "-Xmx2048m -Xms2048m"
Further read: What are the -Xms and -Xmx parameters when starting JVM?

MongoDB - strange behaviour of query

I have this two documents in my mongoDB database:
db.DocumentFile.find().pretty()
{
"_id" : ObjectId("587f39910cc0fec092bdb10c"),
"_class" : "com.smartinnotec.legalprojectmanagement.dao.domain.DocumentFile",
"fileName" : "DocumentFile1",
"ending" : "jpg",
"projectId" : "587f39910cc0fec092bdb10b",
"active" : true,
"userIdBlackList" : [
"587f39910cc0fec092bdb10a"
]
}
{
"_id" : ObjectId("587f39910cc0fec092bdb10d"),
"_class" : "com.smartinnotec.legalprojectmanagement.dao.domain.DocumentFile",
"fileName" : "DocumentFile2",
"ending" : "jpg",
"projectId" : "587f39910cc0fec092bdb10b",
"active" : true,
"userIdBlackList" : [ ]
}
I have this code in order to get amount of query:
final Query query = new Query();
query.addCriteria(Criteria.where("‌​userIdBlackList").nin(userId));
final Long amount = mongoTemplate.count(query, DocumentFile.class);
return amount.intValue();
The amount is 2 in this case what is wrong - it should be 1.
The query in Query object looks like this:
Query: { "‌​userIdBlackList" : { "$nin" : [ "587f39910cc0fec092bdb10a"]}}
If I copy this query and made a query for the mongodb console like this:
db.DocumentFile.find({ "‌​userIdBlackList" : { "$nin" : [ "587f39910cc0fec092bdb10a"]}}).pretty()
I get an amount of two, what if wrong because one document includes 587f39910cc0fec092bdb10a in userIdBlackList -> it should be one.
With this query command:
db.DocumentFile.find({userIdBlackList: { "$nin": ["587f39910cc0fec092bdb10a"] } }).pretty();
I get the right result, I am really confused at the moment.
Does anyone have any idea?
Maybe the problem ist that one time userIdBlackList is with quotation mark ("userIdBlackList") and the other time it isn't.
I think the problem is with the unintentional formatting picked up for "userIdBlackList". Your string is interpreted with non printing characters in the "??userIdBlackList" for all your search queries. I see little transparent square boxes when I copy your queries to mongo shell.
That tells me their is some encoding issue. Clear that formatting and see if that helps you.
Both $ne and $nin should work!

Index_not_found_exception no such index found in elasticsearch using Powershell

I have created two files that are,
jdbc_sqlserver.json:
{
"type": "jdbc",
"jdbc": {
"url": "jdbc:sqlserver://localhost:1433;databaseName=merchant2merchant;integratedSecurity=true;",
"user": "",
"password": "",
"sql": "select * from planets",
"treat_binary_as_string": true,
"elasticsearch": {
"cluster": "elasticsearch",
"host": "localhost",
"port": 9200
},
"index": "testing"
}
}
jdb_sqlserver.ps1:
function Get - PSVersion {
if (test - path variable: psversiontable) {
$psversiontable.psversion
} else {
[version]
"1.0.0.0"
}
}
$powershell = Get - PSVersion
if ($powershell.Major - le 2) {
Write - Error "Oh, so sorry, this script requires Powershell 3
(due to convertto - json)
"
exit
}
if ((Test - Path env: \JAVA_HOME) - eq $false) {
Write - Error "Environment variable JAVA_HOME must be set to your java home"
exit
}
curl - XDELETE "http://localhost:9200/users/"
$DIR = "C:\Program Files\elasticsearch\plugins\elasticsearch-jdbc-2.3.4.0- dist\elasticsearch-jdbc-2.3.4.0\"
$FEEDER_CLASSPATH = "$DIR\lib"
$FEEDER_LOGGER = "file://$DIR\bin\log4j2.xml"
java - cp "$FEEDER_CLASSPATH\*" - "Dlog4j.configurationFile=$FEEDER_LOGGER"
"org.xbib.tools.Runner"
"org.xbib.tools.JDBCImporter"
jdbc_sqlserver.json
and running the second one in Powershell using command .\jdb_sqlserver.ps1 in "C:\servers\elasticsearch\bin\feeder" path but I got error likwIndex_not_found_exception no such index found in powershell.

Categories