I have created two files that are,
jdbc_sqlserver.json:
{
"type": "jdbc",
"jdbc": {
"url": "jdbc:sqlserver://localhost:1433;databaseName=merchant2merchant;integratedSecurity=true;",
"user": "",
"password": "",
"sql": "select * from planets",
"treat_binary_as_string": true,
"elasticsearch": {
"cluster": "elasticsearch",
"host": "localhost",
"port": 9200
},
"index": "testing"
}
}
jdb_sqlserver.ps1:
function Get - PSVersion {
if (test - path variable: psversiontable) {
$psversiontable.psversion
} else {
[version]
"1.0.0.0"
}
}
$powershell = Get - PSVersion
if ($powershell.Major - le 2) {
Write - Error "Oh, so sorry, this script requires Powershell 3
(due to convertto - json)
"
exit
}
if ((Test - Path env: \JAVA_HOME) - eq $false) {
Write - Error "Environment variable JAVA_HOME must be set to your java home"
exit
}
curl - XDELETE "http://localhost:9200/users/"
$DIR = "C:\Program Files\elasticsearch\plugins\elasticsearch-jdbc-2.3.4.0- dist\elasticsearch-jdbc-2.3.4.0\"
$FEEDER_CLASSPATH = "$DIR\lib"
$FEEDER_LOGGER = "file://$DIR\bin\log4j2.xml"
java - cp "$FEEDER_CLASSPATH\*" - "Dlog4j.configurationFile=$FEEDER_LOGGER"
"org.xbib.tools.Runner"
"org.xbib.tools.JDBCImporter"
jdbc_sqlserver.json
and running the second one in Powershell using command .\jdb_sqlserver.ps1 in "C:\servers\elasticsearch\bin\feeder" path but I got error likwIndex_not_found_exception no such index found in powershell.
Related
I am running a gradle task
gradlew -b import.gradle copy_taskName -PinputHost="Host1" -PoutputHost="Host2" -Pduration=1 --stacktrace
In import.gradle , there is a mlcp task, where we are passing a taskName.json(where all the query are written in json format to fetch the data from input host) in query_filter field.
While running the task, I am getting:
Caused by: java.io.IOException: Cannot run program "C:\Program Files\Java\jdk1.8.0_211\bin\java.exe" (in directory "D:\Data1"): CreateProcess error=206, The filename or extension is too long
at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start(DefaultProcessLauncher.java:25)
... 5 more
Caused by: java.io.IOException: CreateProcess error=206, The filename or extension is too long
... 6 more
When I removed some query from taskName.json, I am not getting any issue.
I want to know, are there any constraints over size or number of queries written in taskName.json that we should have to pass in query_filter parameter to run the mlcp task?
total no of line of query content in taskName.json is 398
taskName.json file content sample
{
"andQuery": {
"queries": [{
"collectionQuery": {
"uris": ["collection1"]
}
},
{
"orQuery": {
"queries": [
{
"elementValueQuery": {
"element": ["{http://namespace.com/a/b}id"],
"text": ["text1"],
"options": ["lang=en"]
}
},
{
"elementValueQuery": {
"element": ["{http://namespace.com/a/b}id"],
"text": ["text2"],
"options": ["lang=en"]
}
}]
}
},
{
"notQuery": {
"query": {
"elementRangeQuery": {
"element": ["{http://namespace.com/a/b}date"],
"operator": ">",
"value": [{
"type": "dateTime",
"val": "%%now%%"
}]
}
}
}
}]
}
}
import.gradle
def importDirs = new File("./teams").listFiles()
importDirs.each { importDir ->
def queries = importDir.listFiles()
queries.each { file ->
def taskname = importDir.name + "_" +file.name.replace('.json', '')
task "copy_$taskname" (
type: com.marklogic.gradle.task.MlcpTask,
group: 'abc',
dependsOn: []) {
classpath = configurations.mlcp
command = 'COPY'
input_database = mlAppConfig.contentDatabaseName
input_host = inputHost
input_port = port
input_username = inputUsername
input_password = inputPassword
output_database = mlAppConfig.contentDatabaseName
output_host = outputHost
output_port = port
output_username = outputUsername
output_password = outputPassword
query_filter = file.text.replaceAll('"','\\\\"').replaceAll('%%now%%', now).replaceAll('%%targetDate%%', targetDate)
max_split_size = 500
}
}
}
MlcpTask is extending Gradle's JavaExec task, and the error is coming from the net.rubygrapefruit.platform.internal.DefaultProcessLauncher class. So my hunch is that since the contents of task.json is being tossed into the command line arguments, some limit within the JavaExec class is being reached. You could try extending JavaExec instead to confirm this - I am thinking you'd run into the same error.
One possible solution would be to use the MLCP options_file option. Since you're parameterizing the contents of task.json, you'd likely need to generate the contents of that file dynamically, presumably in a doFirst block on each Gradle task. But that will avoid very long command line arguments, as you'll instead toss them into options_file - and you may only need to toss query_filter in there.
Trying to dump the _id column only.
With mongo shell it finishes in around 2 minutes:
time mongoexport -h localhost -d db1 -c collec1 -f _id -o u.text --csv
connected to: localhost
exported 68675826 records
real 2m20.970s
With java it takes about 30 minutes:
java -cp mongo-test-assembly-0.1.jar com.poshmark.Test
class Test {
public static void main(String[] args) {
MongoClient mongoClient = new MongoClient("localhost");
MongoDatabase database = mongoClient.getDatabase("db1");
MongoCollection<Document> collection = database.getCollection("collec1");
MongoCursor<Document> iterator = collection.find().projection(new Document("_id", 1)).iterator();
while (iterator.hasNext()) {
System.out.println(iterator.next().toString());
}
}
}
CPU usage on box is low, don't see any network latency issues, since both tests are running from same box
Update:
Used Files.newBufferedWriter instead of System.out.println but ended up with same performance.
Looked at db.currentOp(), makes me think that mongo is hitting disk since it is having too many numYields
{
"inprog" : [
{
"desc" : "conn8636699",
"threadId" : "0x79a70c0",
"connectionId" : 8636699,
"opid" : 1625079940,
"active" : true,
"secs_running" : 12,
"microsecs_running" : NumberLong(12008522),
"op" : "getmore",
"ns" : "users.users",
"query" : {
"_id" : {
"$exists" : true
}
},
"client" : "10.1.166.219:60324",
"numYields" : 10848,
"locks" : {
},
"waitingForLock" : false,
"lockStats" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(21696)
},
"acquireWaitCount" : {
"r" : NumberLong(26)
},
"timeAcquiringMicros" : {
"r" : NumberLong(28783)
}
},
"MMAPV1Journal" : {
"acquireCount" : {
"r" : NumberLong(10848)
},
"acquireWaitCount" : {
"r" : NumberLong(5)
},
"timeAcquiringMicros" : {
"r" : NumberLong(40870)
}
},
"Database" : {
"acquireCount" : {
"r" : NumberLong(10848)
}
},
"Collection" : {
"acquireCount" : {
"R" : NumberLong(10848)
}
}
}
}
]
}
The problem resides in STDOUT.
Printing to stdout is not inherently slow. It is the terminal you work with that is slow.
https://stackoverflow.com/a/3860319/3710490
The disk appears to be faster, because it is highly buffered.
The terminal, on the other hand, does little or no buffering: each individual print / write(line) waits for the full write (i.e. display to output device) to complete.
https://stackoverflow.com/a/3857543/3710490
I've reproduced your use case with enough similar dataset.
mongoexport to FILE
$ time "C:\Program Files\MongoDB\Server\4.2\bin\mongoexport.exe" -h localhost -d test -c collec1 -f _id -o u.text --csv
2020-03-28T13:03:01.550+0100 csv flag is deprecated; please use --type=csv instead
2020-03-28T13:03:02.433+0100 connected to: mongodb://localhost/
2020-03-28T13:03:03.479+0100 [........................] test.collec1 0/21028330 (0.0%)
2020-03-28T13:05:02.934+0100 [########################] test.collec1 21028330/21028330 (100.0%)
2020-03-28T13:05:02.934+0100 exported 21028330 records
real 2m1,936s
user 0m0,000s
sys 0m0,000s
mongoexport to STDOUT
$ time "C:\Program Files\MongoDB\Server\4.2\bin\mongoexport.exe" -h localhost -d test -c collec1 -f _id --csv
2020-03-28T14:43:16.479+0100 connected to: mongodb://localhost/
2020-03-28T14:43:16.545+0100 [........................] test.collec1 0/21028330 (0.0%)
2020-03-28T14:53:02.361+0100 [########################] test.collec1 21028330/21028330 (100.0%)
2020-03-28T14:53:02.361+0100 exported 21028330 records
real 9m45,962s
user 0m0,015s
sys 0m0,000s
JAVA to FILE
$ time "C:\Program Files\Java\jdk1.8.0_211\bin\java.exe" -jar mongo-test-assembly-0.1.jar FILE
Wasted time for [FILE] - 271,57 sec
real 4m32,174s
user 0m0,015s
sys 0m0,000s
JAVA to STDOUT to FILE
$ time "C:\Program Files\Java\jdk1.8.0_211\bin\java.exe" -jar mongo-test-assembly-0.1.jar SYSOUT > u.text
real 6m50,962s
user 0m0,015s
sys 0m0,000s
JAVA to STDOUT
$ time "C:\Program Files\Java\jdk1.8.0_211\bin\java.exe" -jar mongo-test-assembly-0.1.jar SYSOUT > u.text
Wasted time for [SYSOUT] - 709,33 sec
real 11m51,276s
user 0m0,000s
sys 0m0,015s
Java code
long init = System.currentTimeMillis();
try (MongoClient mongoClient = new MongoClient("localhost");
BufferedWriter writer = Files.newBufferedWriter(Files.createTempFile("benchmarking", ".tmp"))) {
MongoDatabase database = mongoClient.getDatabase("test");
MongoCollection<Document> collection = database.getCollection("collec1");
MongoCursor<Document> iterator = collection.find().projection(new Document("_id", 1)).iterator();
while (iterator.hasNext()) {
if ("SYSOUT".equals(args[0])) {
System.out.println(iterator.next().get("_id"));
} else {
writer.write(iterator.next().get("_id") + "\n");
}
}
} catch (Exception e) {
e.printStackTrace();
}
long end = System.currentTimeMillis();
System.out.println(String.format("Wasted time for [%s] - %.2f sec", args[0], (end - init) / 1_000.));
I want to Migrating from zuul to spring cloud gateway, I don't want to change my config of previous app. I want to know how to handle with the url with "/api/ + 'serviceId'", route to lb://serviceId
the previous zuul config
zuul:
prefix: /api
there are lots of service regist to eureka ,i don't want to config a route for each one.
eg. the auto generated route by org.springframework.cloud.gateway.discovery.DiscoveryClientRouteDefinitionLocator
{
"route_id": "CompositeDiscoveryClient_APIGATEWAY",
"route_definition": {
"id": "CompositeDiscoveryClient_APIGATEWAY",
"predicates": [
{
"name": "Path",
"args": {
"pattern": "/apigateway/**"
}
}
],
"filters": [
{
"name": "RewritePath",
"args": {
"regexp": "/apigateway/(?<remaining>.*)",
"replacement": "/${remaining}"
}
}
],
"uri": "lb://APIGATEWAY",
"order": 0
}
what i wanted is
{
"route_id": "CompositeDiscoveryClient_APIGATEWAY",
"route_definition": {
"id": "CompositeDiscoveryClient_APIGATEWAY",
"predicates": [
{
"name": "Path",
"args": {
"pattern": "/api/apigateway/**"
}
}
],
"filters": [
{
"name": "RewritePath",
"args": {
"regexp": "/api/apigateway/(?<remaining>.*)",
"replacement": "/${remaining}"
}
}
],
"uri": "lb://APIGATEWAY",
"order": 0
}
how can i config my route to get what i want
And I also found the source code
public static List<PredicateDefinition> initPredicates() {
ArrayList<PredicateDefinition> definitions = new ArrayList<>();
// TODO: add a predicate that matches the url at /serviceId?
// add a predicate that matches the url at /serviceId/**
PredicateDefinition predicate = new PredicateDefinition();
predicate.setName(normalizeRoutePredicateName(PathRoutePredicateFactory.class));
predicate.addArg(PATTERN_KEY, "'/'+serviceId+'/**'");
definitions.add(predicate);
return definitions;
}
the "'/'+serviceId+'/**'" is there without a prefix
[2019-01-10] UPDATE
I think #spencergibb's suggestion is a good solution, but I had new trouble with the (SpEL)
I tried many ways:
args:
regexp: "'/api/' + serviceId.toLowerCase() + '/(?<remaining>.*)'"
replacement: '/${remaining}'
args:
regexp: "'/api/' + serviceId.toLowerCase() + '/(?<remaining>.*)'"
replacement: "'/${remaining}'"
start failed
Origin: class path resource [application.properties]:23:70
Reason: Could not resolve placeholder 'remaining' in value "'${remaining}'"
when i use an escape "\" like
args:
regexp: "'/api/' + serviceId.toLowerCase() + '/(?<remaining>.*)'"
replacement: '/$\{remaining}'
it start success but i got an exception when running
org.springframework.expression.spel.SpelParseException: Expression [/$\{remaining}] #2: EL1065E: unexpected escape character.
at org.springframework.expression.spel.standard.Tokenizer.raiseParseException(Tokenizer.java:590) ~[spring-expression-5.0.5.RELEASE.jar:5.0.5.RELEASE]
at org.springframework.expression.spel.standard.Tokenizer.process(Tokenizer.java:265) ~[spring-expression-5.0.5.RELEASE.jar:5.0.5.RELEASE]
UPDATE 2
I found in the org.springframework.cloud.gateway.filter.factory.RewritePathGatewayFilterFactory, there's a replacement to deal with "\"
...
#Override
public GatewayFilter apply(Config config) {
String replacement = config.replacement.replace("$\\", "$");
return (exchange, chain) -> {
...
when it comes to SpelParseException there's none
You can customize the automatic filters and predicates used via properties.
spring:
cloud:
gateway:
discovery:
locator:
enabled: true
predicates:
- name: Path
args:
pattern: "'/api/'+serviceId.toLowerCase()+'/**'"
filters:
- name: RewritePath
args:
regexp: "'/api/' + serviceId.toLowerCase() + '/(?<remaining>.*)'"
replacement: "'/${remaining}'"
Note the values (ie args.pattern or args.regexp) are all Spring Expression Language (SpEL) expressions, hence the single quotes and + etc...
If different routes need to have different prefixes, you'd need to define each route in properties.
This question already has answers here:
How to deal with "java.lang.OutOfMemoryError: Java heap space" error?
(31 answers)
Closed 1 year ago.
I have a huge postgres database with 20 million rows and i want to transfer it to elasticsearch via logstash . I followed the advice mentioned here and I test it for a simple database with 300 rows and all things worked fine but when i tested it for my main database i allways cross with error:
nargess#nargess-Surface-Book:/usr/share/logstash/bin$ sudo ./logstash -w 1 -f students.conf --path.data /usr/share/logstash/data/students/ --path.settings /etc/logstash
Sending Logstash's logs to /var/log/logstash which is now configured via log4j2.properties
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid3453.hprof ...
Heap dump file created [13385912484 bytes in 53.304 secs]
Exception in thread "Ruby-0-Thread-11: /usr/share/logstash/vendor/bundle/jruby/1.9/gems/puma-2.16.0-java/lib/puma/thread_pool.rb:216" java.lang.ArrayIndexOutOfBoundsException: -1
at org.jruby.runtime.ThreadContext.popRubyClass(ThreadContext.java:729)
at org.jruby.runtime.ThreadContext.postYield(ThreadContext.java:1292)
at org.jruby.runtime.ContextAwareBlockBody.post(ContextAwareBlockBody.java:29)
at org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:198)
at org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)
at org.jruby.runtime.Block.call(Block.java:101)
at org.jruby.RubyProc.call(RubyProc.java:300)
at org.jruby.RubyProc.call(RubyProc.java:230)
at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:103)
at java.lang.Thread.run(Thread.java:748)
The signal INT is in use by the JVM and will not work correctly on this platform
Error: Your application used more memory than the safety cap of 12G.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace
Although I go to file /etc/logstash/jvm.options and set -Xms256m
-Xmx12000m, but I have had these errors yet. I have 13g memory free. how can i send my data to elastic search with this memory ?
this is the student-index.json that i use in elasticsearch
{
"aliases": {},
"warmers": {},
"mappings": {
"tab_students_dfe": {
"properties": {
"stcode": {
"type": "text"
},
"voroodi": {
"type": "integer"
},
"name": {
"type": "text"
},
"family": {
"type": "text"
},
"namp": {
"type": "text"
},
"lastupdate": {
"type": "date"
},
"picture": {
"type": "text"
},
"uniquename": {
"type": "text"
}
}
}
},
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1"
}
}
}
then i try to insert this index in elastic search by :
curl -XPUT --header "Content-Type: application/json"
http://localhost:9200/students -d #postgres-index.json
and next, this is my configuration fil in /usr/shar/logstash/bin/students.conf file :
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://localhost:5432/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres"
# The path to downloaded jdbc driver
jdbc_driver_library => "./postgresql-42.2.1.jar"
jdbc_driver_class => "org.postgresql.Driver"
# The path to the file containing the query
statement => "select * from students"
}
}
filter {
aggregate {
task_id => "%{stcode}"
code => "
map['stcode'] = event.get('stcode')
map['voroodi'] = event.get('voroodi')
map['name'] = event.get('name')
map['family'] = event.get('family')
map['namp'] = event.get('namp')
map['uniquename'] = event.get('uniquename')
event.cancel()
"
push_previous_map_as_event => true
timeout => 5
}
}
output {
elasticsearch {
document_id => "%{stcode}"
document_type => "postgres"
index => "students"
codec => "json"
hosts => ["127.0.0.1:9200"]
}
}
Thank you for your help
This is a bit old, but I just had the same issue and increasing the heap size of logstash helped me here. I added this to my logstash service in the docker-compose file:
environment:
LS_JAVA_OPTS: "-Xmx2048m -Xms2048m"
Further read: What are the -Xms and -Xmx parameters when starting JVM?
Get a strange behavior of my Elasticsearch cluster: seems that it does not see static groovy scripts any more.
It complains that "dynamic scripting is disabled", however I am using stating script and correct name of it.
They did work before, and now can't understand what was changed.
Here are steps I am using to reproduce the problem:
Create index with mapping defining one string field and on nested object:
curl -XPUT localhost:9200/test/ -d '{
"index": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}'
curl -XPUT localhost:9200/test/_mapping/testtype -d '{
"testtype": {
"properties": {
"name": {
"type": "string"
},
"features": {
"type": "nested",
"properties": {
"key": {
"type": "string",
"value": {
"type": "string"
}
}
}
}
}
}
}'
response:
{
"acknowledged": true
}
Put a single object there:
curl -XPUT localhost:9200/test/testtype/1 -d '{
"name": "hello",
"features": []
}'
Call update using the script:
curl -XPOST http://localhost:9200/test/testtype/1/_update -d '{
"script": "add-feature-if-not-exists",
"params": {
"new_feature": {
"key": "Manufacturer",
"value": "China"
}
}
}'
response:
{
"error": "RemoteTransportException[[esnew1][inet[/78.46.100.39:9300]][indices:data/write/update]];
nested: ElasticsearchIllegalArgumentException[failed to execute script];
nested: ScriptException[dynamic scripting for [groovy] disabled]; ",
"status": 400
}
Getting "dynamic scripting for [groovy] disabled" - but I am using a reference to static script name in "script" field. However I've seen this message occurring if the name of script was incorrect. But looks like it is correct:
The script is located in /etc/elasticsearch/scripts/ .
Verifying that /etc/elasticsearch is used as a config directory:
ps aux | grep elas
elastic+ 944 0.8 4.0 21523740 1322820 ? Sl 15:35 0:39 /usr/lib/jvm/java-7-oracle/bin/java -Xms16g -Xmx16g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Delasticsearch -Des.pidfile=/var/run/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.4.0.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/* -Des.default.config=/etc/elasticsearch/elasticsearch.yml -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.work=/tmp/elasticsearch -Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch
Looking if script is there:
$ ls -l /etc/elasticsearch/ total 24 -rw-r--r-- 1 root root
total 24
-rw-r--r-- 1 root root 13683 Nov 25 14:52 elasticsearch.yml
-rw-r--r-- 1 root root 1511 Nov 15 04:13 logging.yml
drwxr-xr-x 2 root root 4096 Nov 25 15:07 scripts
$ ls -l /etc/elasticsearch/scripts/ total 8 -rw-r--r-- 1
total 8
-rw-r--r-- 1 elasticsearch elasticsearch 438 Nov 25 15:07 add-feature-if-not-exists.groovy
-rw-r--r-- 1 elasticsearch elasticsearch 506 Nov 23 02:52 add-review-if-not-exists.groovy
Any hints on why this is happening? What else have I check?
Update: cluster has two nodes.
Config on node1:
cluster.name: myclustername
node.name: "esnew1"
node.master: true
node.data: true
bootstrap.mlockall: true
network.host: zz.zz.zz.zz
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["esnew1.company.com","esnew2.company.com"]
index.store.type: niofs
script.disable_dynamic: true
script.auto_reload_enabled: true
watcher.interval: 30s
Config on node 2:
cluster.name: myclustername
node.name: "esnew2"
node.master: true
node.data: true
bootstrap.mlockall: true
metwork.bind_host: 127.0.0.1,zz.zz.zz.zz
network.publish_host: zz.zz.zz.zz
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["esnew1.company.com","esnew2.company.com"]
index.store.type: niofs
script.disable_dynamic: true
script.auto_reload_enabled: true
watcher.interval: 30s
Elasticsearch version:
$ curl localhost:9200
{
"status" : 200,
"name" : "esnew2",
"cluster_name" : "myclustername",
"version" : {
"number" : "1.4.0",
"build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
"build_timestamp" : "2014-11-05T14:26:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}
P.S.: One observation confirming that probably ES just don't see the scripts: at some moment, ES was seeing one script, but did not see another. After restart it does not see none of them.
P.P.S.: The script:
do_add = false
def stringsEqual(s1, s2) {
if (s1 == null) {
return s2 == null;
}
return s1.equalsIgnoreCase(s2);
}
for (item in ctx._source.features) {
if (stringsEqual(item['key'], new_feature.key) {
if (! stringsEqual(item['value'], new_feature.value) {
item.value = new_feature.value;
}
} else {
do_add = true
}
}
if (do_add) {
ctx._source.features.add(new_feature)
}