different output while running a command from java - java

I have following script:
#!/bin/bash
ID=$PPID
read PID < <(exec ps -o ppid= "$ID")
top -cbn 1 -p $PID
grep -f <(pstree -cp $PID | grep -Po '\(\K\d+'| sed -re 's/$/ /g' | sed -re 's/^/^\\s\*/g' ) <(top -cbn 1)
When I am running this script from command prompt the output is
top - 16:43:17 up 6:40, 6 users, load average: 0.04, 0.02, 0.00
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.9%us, 0.6%sy, 0.0%ni, 98.1%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4152800k total, 1908108k used, 2244692k free, 92984k buffers
Swap: 1048568k total, 0k used, 1048568k free, 1104756k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16866 root 40 0 7792 3188 2088 S 0.0 0.1 0:00.01 su
16750 builder 40 0 1213m 262m 15m S 0.0 6.5 0:55.43 /opt/ibm/java-i386-60/jre/bin/java -Dosgi.requiredJavaVersion=1.6 -XX:MaxPermSize=256m -Xms40m -Xmx1024m -jar /home/builder/eclipse//plugins/org.eclipse.equinox.launcher_1.3.0.
Notice the output of COMMAND and it is displaying the complete command. But if I run the same program from java, the output truncate the COMMAND name and I am not able to figure out why ?
Truncation occur after 19 character.
Or in different words
Every line truncate after 80 character.
Here is the output if I run the same program using Java
top - 16:13:52 up 6:10, 6 users, load average: 0.16, 0.16, 0.06
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.9%us, 0.6%sy, 0.0%ni, 98.0%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4152800k total, 1913364k used, 2239436k free, 91560k buffers
Swap: 1048568k total, 0k used, 1048568k free, 1103952k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16750 builder 40 0 1206m 257m 15m S 0.0 6.3 0:26.32 /opt/ibm/java-i386-
16750 builder 40 0 1206m 257m 15m S 1.9 6.3 0:26.33 /opt/ibm/java-i386-
16943 builder 20 0 2608 1008 748 R 1.9 0.0 0:00.02 top -cbn 1
16918 builder 20 0 554m 14m 6060 S 0.0 0.4 0:00.10 /opt/ibm/java-i386-
16934 builder 20 0 4976 1120 992 S 0.0 0.0 0:00.00 /bin/bash /home/bui
16941 builder 20 0 4976 508 376 S 0.0 0.0 0:00.00 /bin/bash /home/bui
16942 builder 20 0 4388 888 608 S 0.0 0.0 0:00.00 grep -f /dev/fd/63
My java file to run the command is
public class InformationFetcher {
public static void main(String[] args) {
InformationFetcher informationFetcher = new InformationFetcher();
try {
Process process = Runtime.getRuntime().exec(
informationFetcher.getFilePath());
InputStream in = process.getInputStream();
printInputStream(in);
} catch (IOException e) {
e.printStackTrace();
}
}
private void processInformation(InputStream in) {
topProcessor.process(in);
}
private static void printInputStream(InputStream in) throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
StringBuffer outBuffer = new StringBuffer();
String newLine = System.getProperty("line.separator");
String line;
while ((line = reader.readLine()) != null) {
outBuffer.append(line);
outBuffer.append(newLine);
}
System.out.println(outBuffer.toString());
}
public String getFilePath() {
return this.getClass().getResource("/idFetcher.sh").getPath();
}
}

Related

Golang and apache AB

I have a system with HTTP POST requests and it runs with Spring 5 (standalone tomcat). In short it looks like this:
client (Apache AB) ----> micro service (java or golang) --> RabbitMQ --> Core(spring + tomcat).
The thing is, when I use my Java (Spring) service, it is ok. AB shows this output:
ab -n 1000 -k -s 2 -c 10 -s 60 -p test2.sh -A 113:113 -T 'application/json' https://127.0.0.1:8449/SecureChat/chat/v1/rest-message/send
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
...
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 8449
SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
Document Path: /rest-message/send
Document Length: 39 bytes
Concurrency Level: 10
Time taken for tests: 434.853 seconds
Complete requests: 1000
Failed requests: 0
Keep-Alive requests: 0
Total transferred: 498000 bytes
Total body sent: 393000
HTML transferred: 39000 bytes
Requests per second: 2.30 [#/sec] (mean)
Time per request: 4348.528 [ms] (mean)
Time per request: 434.853 [ms] (mean, across all concurrent
requests)
Transfer rate: 1.12 [Kbytes/sec] received
0.88 kb/s sent
2.00 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 4 14 7.6 17 53
Processing: 1110 4317 437.2 4285 8383
Waiting: 1107 4314 437.2 4282 8377
Total: 1126 4332 436.8 4300 8403
That is through TLS.
But when I try to use my Golang service I get timeout:
Benchmarking 127.0.0.1 (be patient)...apr_pollset_poll: The timeout specified has expired (70007)
Total of 92 requests completed
And this output:
ab -n 100 -k -s 2 -c 10 -s 60 -p test2.sh -T 'application/json' http://127.0.0.1:8089/
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)...^C
Server Software:
Server Hostname: 127.0.0.1
Server Port: 8089
Document Path: /
Document Length: 39 bytes
Concurrency Level: 10
Time taken for tests: 145.734 seconds
Complete requests: 92
Failed requests: 1
(Connect: 0, Receive: 0, Length: 1, Exceptions: 0)
Keep-Alive requests: 91
Total transferred: 16380 bytes
Total body sent: 32200
HTML transferred: 3549 bytes
Requests per second: 0.63 [#/sec] (mean)
Time per request: 15840.663 [ms] (mean)
Time per request: 1584.066 [ms] (mean, across all concurrent requests)
Transfer rate: 0.11 [Kbytes/sec] received
0.22 kb/s sent
0.33 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 1229 1494 1955.9 1262 20000
Waiting: 1229 1291 143.8 1262 2212
Total: 1229 1494 1955.9 1262 20000
That is through plane tcp.
I guess I have some mistakes in my code. I made it in one file
func initAmqp(rabbitUrl string) {
var err error
conn, err = amqp.Dial(rabbitUrl)
failOnError(err, "Failed to connect to RabbitMQ")
}
func main() {
err := gcfg.ReadFileInto(&cfg, "config.gcfg")
if err != nil {
log.Fatal(err);
}
PrintConfig(cfg)
if cfg.Section_rabbit.RabbitUrl != "" {
initAmqp(cfg.Section_rabbit.RabbitUrl);
}
mux := http.NewServeMux();
mux.Handle("/", NewLimitHandler(1000, newTestHandler()))
server := http.Server {
Addr: cfg.Section_basic.Port,
Handler: mux,
ReadTimeout: 20 * time.Second,
WriteTimeout: 20 * time.Second,
}
defer conn.Close();
log.Println(server.ListenAndServe());
}
func NewLimitHandler(maxConns int, handler http.Handler) http.Handler {
h := &limitHandler{
connc: make(chan struct{}, maxConns),
handler: handler,
}
for i := 0; i < maxConns; i++ {
h.connc <- struct{}{}
}
return h
}
func newTestHandler() http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
handler(w, r);
})
}
func handler(w http.ResponseWriter, r *http.Request) {
if b, err := ioutil.ReadAll(r.Body); err == nil {
fmt.Println("message is ", string(b));
res := publishMessages(string(b))
w.Write([]byte(res))
w.WriteHeader(http.StatusOK)
counter ++;
}else {
w.WriteHeader(http.StatusInternalServerError)
w.Write([]byte("500 - Something bad happened!"))
}
}
func publishMessages(payload string) string {
ch, err := conn.Channel()
failOnError(err, "Failed to open a channel")
q, err = ch.QueueDeclare(
"", // name
false, // durable
false, // delete when unused
true, // exclusive
false, // noWait
nil, // arguments
)
failOnError(err, "Failed to declare a queue")
msgs, err := ch.Consume(
q.Name, // queue
"", // consumer
true, // auto-ack
false, // exclusive
false, // no-local
false, // no-wait
nil, // args
)
failOnError(err, "Failed to register a consumer")
corrId := randomString(32)
log.Println("corrId ", corrId)
err = ch.Publish(
"", // exchange
cfg.Section_rabbit.RabbitQeue, // routing key
false, // mandatory
false, // immediate
amqp.Publishing{
DeliveryMode: amqp.Transient,
ContentType: "application/json",
CorrelationId: corrId,
Body: []byte(payload),
Timestamp: time.Now(),
ReplyTo: q.Name,
})
failOnError(err, "Failed to Publish on RabbitMQ")
defer ch.Close();
result := "";
for d := range msgs {
if corrId == d.CorrelationId {
failOnError(err, "Failed to convert body to integer")
log.Println("result = ", string(d.Body))
return string(d.Body);
}else {
log.Println("waiting for result = ")
}
}
return result;
}
Can someone help?
EDIT
here are my variables
type limitHandler struct {
connc chan struct{}
handler http.Handler
}
var conn *amqp.Connection
var q amqp.Queue
EDIT 2
func (h *limitHandler) ServeHTTP(w http.ResponseWriter, req *http.Request) {
select {
case <-h.connc:
fmt.Println("ServeHTTP");
h.handler.ServeHTTP(w, req)
h.connc <- struct{}{}
default:
http.Error(w, "503 too busy", http.StatusServiceUnavailable)
}
}
EDIT 3
func failOnError(err error, msg string) {
if err != nil {
log.Fatalf("%s: %s", msg, err)
panic(fmt.Sprintf("%s: %s", msg, err))
}
}

Finding substring from a string using regex java

I have a String:
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
I want to get the path (in this case D:\\workdir\\PV 81\\config\\sum81pv.pwf) from this string. This path is an argument of a command option -sn or -n, so this path always appears after these options.
The path may or may not contain whitespaces, which needs to be handled.
public class TestClass {
public static void main(String[] args) {
String path;
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
path = s.replaceAll(".*(-sn|-n) \"?([^ ]*)?", "$2");
System.out.println("Path: " + path);
}
}
Current output: Path: D:\workdir\PV 81\config\sum81pv.pwf -C 5000
Expected output: Path: D:\workdir\PV 81\config\sum81pv.pwf
Below Answers working fine for the earlier case.
i need a regex which return `*.pwf` path if the option is `-sn, -n, -s, -s -n, or without -s or -n.`
But if I have below case then what would be the regex to find password file.
String s1 = msqllab91 0 0 1 50 50 60 /mti/root/bin/msqlora -n "tmp/my.pwf" -s
String s2 = msqllab92 0 0 1 50 50 60 /mti/root/bin/msqlora -s -n /mti/root/my.pwf
String s3 = msqllab93 0 0 1 50 50 60 msqlora -s -n "/mti/root/my.pwf" -C 10000
String s4 = msqllab94 0 0 1 50 50 60 msqlora.exe -sn /mti/root/my.pwf
String s5 = msqllab95 0 0 1 50 50 60 msqlora.exe -sn "/mti/root"/my.pwf
String s6 = msqllab96 0 0 1 50 50 60 msqlora.exe -sn"/mti/root"/my.pwf
String s7 = msqllab97 0 0 1 50 50 60 "/mti/root/bin/msqlora" -s -n /mti/root/my.pwf -s
String s8 = msqllab98 0 0 1 50 50 60 /mti/root/bin/msqlora -s
String s9 = msqllab99 0 0 1 50 50 60 /mti/root/bin/msqlora -s -n /mti/root/my.NOTpwf -s -n /mti/root/my.pwf
String s10 = msqllab90 0 0 1 50 50 60 /mti/root/bin/msqlora -sn /mti/root/my.NOTpwf -sn /mti/root/my.pwf
String s11 = msqllab901 0 0 1 50 50 60 /mti/root/bin/msqlora
String s12 = msqllab902 0 0 1 50 50 60 /mti/root/msqlora-n NOTmy.pwf
String s13 = msqllab903 0 0 1 50 50 60 /mti/root/msqlora-n.exe NOTmy.pwf
i need a regex which return *.pwf path if the option is -sn, -n, -s, -s -n, or without -s or -n.
path contains *.pwf file extension only not NOTpwf or any other extension and code should all work except the last two because it is an invalid command.
Note: I already asked this type of question but didn't get anything working as per my requirement. (How to get specific substring with option vale using java)
You can use:
path = s.replaceFirst(".*\\s-s?n\\s*(.+?)(?:\\s-.*|$)", "$1");
//=> D:\workdir\PV 81\config\sum81pv.pwf
Code Demo
RegEx Demo
Try this
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
int l=s.indexOf("-sn");
int l1=s.indexOf("-C");
System.out.println(s.substring(l+4,l1-2));
You can also use : [A-Z]:.*\.\w+
Demo and Explaination
Rather than using complex regexps for replacing, I'd rather suggest a simpler one for matching:
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
Pattern pattern = Pattern.compile("\\s-s?n\\s*(.*?)\\s*-C\\s+\\d+$");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group(1));
}
// => D:\workdir\PV 81\config\sum81pv.pwf
See the IDEONE Demo
If the -C <NUMBER> is optional at the end, wrap with an optional group -> (?:\\s*-C\\s+\\d+)?$.
Pattern details:
\\s - a whitespace
-s?n - a -sn or -n (as s? matches an optional s)
\\s* - 0+ whitespaces
(.*?) - Group 1 matching any 0+ chars other than a newline
\\s* - ibid
-C - a literal -C
\\s+ - 1+ whitespaces
\\d+ - 1 or more digits
$ - end of string.

100% CPU usage by Java

I'm facing an issue that occurs randomly and causes a 100% CPU usage. I've found the PID of the thread which is actually using CPU.
Main PID: 22777
Thread PID: 22793
From top -H -u user
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22793 user 25 0 2640m 16m 14m R 98.8 0.4 5134:31 java
22480 user 25 0 7920 364 360 S 0.0 0.0 0:00.00 cat
22777 user 18 0 2640m 16m 14m S 0.0 0.4 0:00.00 java
22779 user 18 0 2640m 16m 14m S 0.0 0.4 0:03.34 java
22780 user 15 0 2640m 16m 14m S 0.0 0.4 0:46.76 java
22781 user 15 0 2640m 16m 14m S 0.0 0.4 0:00.49 java
{...}
From jstack -J-d64 -m 22777
{...}
----------------- 22793 -----------------
0x00002b9edcd4c5a0 _ZN12Dependencies25find_finalizable_subclassEP5Klass + 0x150
0x00002b9edcc5a8ee _ZN15ciInstanceKlass24has_finalizable_subclassEv + 0xbe
0x00002b9edcb9f83e _ZN12GraphBuilder23call_register_finalizerEv + 0x9e
0x00002b9edcba62a5 _ZN12GraphBuilder13method_returnEP11Instruction + 0x295
0x00002b9edcbac85f _ZN12GraphBuilder27iterate_bytecodes_for_blockEi + 0x6cf
0x00002b9edcba9c4b _ZN12GraphBuilder18iterate_all_blocksEb + 0x14b
0x00002b9edcbaa5e6 _ZN12GraphBuilder15try_inline_fullEP8ciMethodbN9Bytecodes4CodeEP11Instruction + 0x996
0x00002b9edcbaa7df _ZN12GraphBuilder10try_inlineEP8ciMethodbN9Bytecodes4CodeEP11Instruction + 0x11f
0x00002b9edcbab912 _ZN12GraphBuilder6invokeEN9Bytecodes4CodeE + 0xbb2
0x00002b9edcbac83d _ZN12GraphBuilder27iterate_bytecodes_for_blockEi + 0x6ad
0x00002b9edcba9c4b _ZN12GraphBuilder18iterate_all_blocksEb + 0x14b
0x00002b9edcbaa5e6 _ZN12GraphBuilder15try_inline_fullEP8ciMethodbN9Bytecodes4CodeEP11Instruction + 0x996
0x00002b9edcbaa7df _ZN12GraphBuilder10try_inlineEP8ciMethodbN9Bytecodes4CodeEP11Instruction + 0x11f
0x00002b9edcbab912 _ZN12GraphBuilder6invokeEN9Bytecodes4CodeE + 0xbb2
0x00002b9edcbac83d _ZN12GraphBuilder27iterate_bytecodes_for_blockEi + 0x6ad
0x00002b9edcba9bf2 _ZN12GraphBuilder18iterate_all_blocksEb + 0xf2
0x00002b9edcbae7a7 _ZN12GraphBuilderC1EP11CompilationP7IRScope + 0x527
0x00002b9edcbb7127 _ZN7IRScopeC1EP11CompilationPS_iP8ciMethodib + 0x1e7
0x00002b9edcbb723f _ZN2IRC1EP11CompilationP8ciMethodi + 0x9f
0x00002b9edcb9625b _ZN11Compilation9build_hirEv + 0xdb
0x00002b9edcb9661e _ZN11Compilation19compile_java_methodEv + 0x6e
0x00002b9edcb9674e _ZN11Compilation14compile_methodEv + 0x4e
0x00002b9edcb96abe _ZN11CompilationC1EP16AbstractCompilerP5ciEnvP8ciMethodiP10BufferBlob + 0x25e
0x00002b9edcb97869 _ZN8Compiler14compile_methodEP5ciEnvP8ciMethodi + 0xa9
0x00002b9edccea43a _ZN13CompileBroker25invoke_compiler_on_methodEP11CompileTask + 0xc9a
0x00002b9edcceb3e6 _ZN13CompileBroker20compiler_thread_loopEv + 0x5d6
0x00002b9edd29ebcf _ZN10JavaThread17thread_main_innerEv + 0xdf
0x00002b9edd29ecfc _ZN10JavaThread3runEv + 0x11c
0x00002b9edd153048 _ZL10java_startP6Thread + 0x108
{...}
It looks like a JVM bug when compiling something.
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)
CentOS 5.5
Kernel: 2.6.18-194.el5
Other processes aren't being affected, but the Java process is completely unresponsive. It also occurs at other CentOS servers, but not at Oracle Linux servers.

Implement a java UDF and call it from pyspark

I need to create a UDF to be used in pyspark python which uses a java object for its internal calculations.
If it were a simple python I would do something like:
def f(x):
return 7
fudf = pyspark.sql.functions.udf(f,pyspark.sql.types.IntegerType())
and call it using:
df = sqlContext.range(0,5)
df2 = df.withColumn("a",fudf(df.id)).show()
However, the implementation of the function I need is in java and not in python. I need to wrap it somehow so I can call it in a similar way from python.
My first try was to do implement the java object, then wrap it in python in pyspark and convert that to UDF. That failed with serialization error.
Java code:
package com.test1.test2;
public class TestClass1 {
Integer internalVal;
public TestClass1(Integer val1) {
internalVal = val1;
}
public Integer do_something(Integer val) {
return internalVal;
}
}
pyspark code:
from py4j.java_gateway import java_import
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
java_import(sc._gateway.jvm, "com.test1.test2.TestClass1")
a = sc._gateway.jvm.com.test1.test2.TestClass1(7)
audf = udf(a,IntegerType())
error:
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
<ipython-input-2-9756772ab14f> in <module>()
4 java_import(sc._gateway.jvm, "com.test1.test2.TestClass1")
5 a = sc._gateway.jvm.com.test1.test2.TestClass1(7)
----> 6 audf = udf(a,IntegerType())
/usr/local/spark/python/pyspark/sql/functions.py in udf(f, returnType)
1595 [Row(slen=5), Row(slen=3)]
1596 """
-> 1597 return UserDefinedFunction(f, returnType)
1598
1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
/usr/local/spark/python/pyspark/sql/functions.py in __init__(self, func, returnType, name)
1556 self.returnType = returnType
1557 self._broadcast = None
-> 1558 self._judf = self._create_judf(name)
1559
1560 def _create_judf(self, name):
/usr/local/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
1565 command = (func, None, ser, ser)
1566 sc = SparkContext.getOrCreate()
-> 1567 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command, self)
1568 ctx = SQLContext.getOrCreate(sc)
1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
/usr/local/spark/python/pyspark/rdd.py in _prepare_for_python_RDD(sc, command, obj)
2297 # the serialized command will be compressed by broadcast
2298 ser = CloudPickleSerializer()
-> 2299 pickled_command = ser.dumps(command)
2300 if len(pickled_command) > (1 << 20): # 1M
2301 # The broadcast will have same life cycle as created PythonRDD
/usr/local/spark/python/pyspark/serializers.py in dumps(self, obj)
426
427 def dumps(self, obj):
--> 428 return cloudpickle.dumps(obj, 2)
429
430
/usr/local/spark/python/pyspark/cloudpickle.py in dumps(obj, protocol)
644
645 cp = CloudPickler(file,protocol)
--> 646 cp.dump(obj)
647
648 return file.getvalue()
/usr/local/spark/python/pyspark/cloudpickle.py in dump(self, obj)
105 self.inject_addons()
106 try:
--> 107 return Pickler.dump(self, obj)
108 except RuntimeError as e:
109 if 'recursion' in e.args[0]:
/home/mendea3/anaconda2/lib/python2.7/pickle.pyc in dump(self, obj)
222 if self.proto >= 2:
223 self.write(PROTO + chr(self.proto))
--> 224 self.save(obj)
225 self.write(STOP)
226
/home/mendea3/anaconda2/lib/python2.7/pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288
/home/mendea3/anaconda2/lib/python2.7/pickle.pyc in save_tuple(self, obj)
566 write(MARK)
567 for element in obj:
--> 568 save(element)
569
570 if id(obj) in memo:
/home/mendea3/anaconda2/lib/python2.7/pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288
/usr/local/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
191 if islambda(obj) or obj.__code__.co_filename == '<stdin>' or themodule is None:
192 #print("save global", islambda(obj), obj.__code__.co_filename, modname, themodule)
--> 193 self.save_function_tuple(obj)
194 return
195 else:
/usr/local/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
234 # create a skeleton function object and memoize it
235 save(_make_skel_func)
--> 236 save((code, closure, base_globals))
237 write(pickle.REDUCE)
238 self.memoize(func)
/home/mendea3/anaconda2/lib/python2.7/pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288
/home/mendea3/anaconda2/lib/python2.7/pickle.pyc in save_tuple(self, obj)
552 if n <= 3 and proto >= 2:
553 for element in obj:
--> 554 save(element)
555 # Subtle. Same as in the big comment below.
556 if id(obj) in memo:
/home/mendea3/anaconda2/lib/python2.7/pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288
/home/mendea3/anaconda2/lib/python2.7/pickle.pyc in save_list(self, obj)
604
605 self.memoize(obj)
--> 606 self._batch_appends(iter(obj))
607
608 dispatch[ListType] = save_list
/home/mendea3/anaconda2/lib/python2.7/pickle.pyc in _batch_appends(self, items)
637 write(MARK)
638 for x in tmp:
--> 639 save(x)
640 write(APPENDS)
641 elif n:
/home/mendea3/anaconda2/lib/python2.7/pickle.pyc in save(self, obj)
304 reduce = getattr(obj, "__reduce_ex__", None)
305 if reduce:
--> 306 rv = reduce(self.proto)
307 else:
308 reduce = getattr(obj, "__reduce__", None)
/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JError(
311 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
--> 312 format(target_id, ".", name, value))
313 else:
314 raise Py4JError(
Py4JError: An error occurred while calling o18.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
EDIT: I also tried to make the java class serializable but to no avail.
My second attempt was to define the UDF in java to begin with but that failed as I am not sure how to correctly wrap it:
java code:
package com.test1.test2;
import org.apache.spark.sql.api.java.UDF1;
public class TestClassUdf implements UDF1<Integer, Integer> {
Integer retval;
public TestClassUdf(Integer val) {
retval = val;
}
#Override
public Integer call(Integer arg0) throws Exception {
return retval;
}
}
but how would I use it?
I tried:
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm, "com.test1.test2.TestClassUdf")
a = sc._gateway.jvm.com.test1.test2.TestClassUdf(7)
dfint = sqlContext.range(0,15)
df = dfint.withColumn("a",a(dfint.id))
but I get:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-514811090b5f> in <module>()
3 a = sc._gateway.jvm.com.test1.test2.TestClassUdf(7)
4 dfint = sqlContext.range(0,15)
----> 5 df = dfint.withColumn("a",a(dfint.id))
TypeError: 'JavaObject' object is not callable
and I tried to use a.call instead of a:
df = dfint.withColumn("a",a.call(dfint.id))
but got:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
3 a = sc._gateway.jvm.com.test1.test2.TestClassUdf(7)
4 dfint = sqlContext.range(0,15)
----> 5 df = dfint.withColumn("a",a.call(dfint.id))
/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
796 def __call__(self, *args):
797 if self.converters is not None and len(self.converters) > 0:
--> 798 (new_args, temp_args) = self._get_args(args)
799 else:
800 new_args = args
/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in _get_args(self, args)
783 for converter in self.gateway_client.converters:
784 if converter.can_convert(arg):
--> 785 temp_arg = converter.convert(arg, self.gateway_client)
786 temp_args.append(temp_arg)
787 new_args.append(temp_arg)
/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_collections.py in convert(self, object, gateway_client)
510 HashMap = JavaClass("java.util.HashMap", gateway_client)
511 java_map = HashMap()
--> 512 for key in object.keys():
513 java_map[key] = object[key]
514 return java_map
TypeError: 'Column' object is not callable
Any help would be appriciated.
I got this working with the help of another question (and answer) of your own about UDAFs.
Spark provides a udf() method for wrapping Scala FunctionN, so we can wrap the Java function in Scala and use that. Your Java method needs to be static or on a class that implements Serializable.
package com.example
import org.apache.spark.sql.UserDefinedFunction
import org.apache.spark.sql.functions.udf
class MyUdf extends Serializable {
def getUdf: UserDefinedFunction = udf(() => MyJavaClass.MyJavaMethod())
}
Usage in PySpark:
def my_udf():
from pyspark.sql.column import Column, _to_java_column, _to_seq
pcls = "com.example.MyUdf"
jc = sc._jvm.java.lang.Thread.currentThread() \
.getContextClassLoader().loadClass(pcls).newInstance().getUdf().apply
return Column(jc(_to_seq(sc, [], _to_java_column)))
rdd1 = sc.parallelize([{'c1': 'a'}, {'c1': 'b'}, {'c1': 'c'}])
df1 = rdd1.toDF()
df2 = df1.withColumn('mycol', my_udf())
As with the UDAF in your other question and answer, we can pass columns into it with return Column(jc(_to_seq(sc, ["col1", "col2"], _to_java_column)))
In lines with https://dzone.com/articles/pyspark-java-udf-integration-1 you could define UDF1 with in Java using
public class AddNumber implements UDF1<Long, Long> {
#Override
public Long call(Long num) throws Exception {
return (num + 5);
}
}
And then after adding the jar to your pyspark with --package <your-jar>
you can use it in pyspark as:
from pyspark.sql import functions as F
from pyspark.sql.types import LongType
>>> df = spark.createDataFrame([float(i) for i in range(100)], FloatType()).toDF("a")
>>> spark.udf.registerJavaFunction("addNumber", "com.example.spark.AddNumber", LongType())
>>> df.withColumn("b", F.expr("addNumber(a)")).show(5)
+---+---+
| a| b|
+---+---+
|0.0| 5|
|1.0| 6|
|2.0| 7|
|3.0| 8|
|4.0| 8|
+---+---+
only showing top 5 rows

Comparing lines in a file

I am trying to compare File 1 and File 2.
File 1:
7.3 0.28 0.36 12.7 0.04 38 140 0.998 3.3 0.79 9.6 6 1
7.4 0.33 0.26 15.6 0.049 67 210 0.99907 3.06 0.68 9.5 5 1
7.3 0.25 0.39 6.4 0.034 8 84 0.9942 3.18 0.46 11.5 5 1
6.9 0.38 0.25 9.8 0.04 28 191 0.9971 3.28 0.61 9.2 5 1
5.1 0.11 0.32 1.6 0.028 12 90 0.99008 3.57 0.52 12.2 6 1
File 2:
5.1 0.11 0.32 1.6 0.028 12 90 0.99008 3.57 0.52 12.2 6 -1
7.3 0.25 0.39 6.4 0.034 8 84 0.9942 3.18 0.46 11.5 5 1
6.9 0.38 0.25 9.8 0.04 28 191 0.9971 3.28 0.61 9.2 5 -1
7.4 0.33 0.26 15.6 0.049 67 210 0.99907 3.06 0.68 9.5 5 -1
7.3 0.28 0.36 12.7 0.04 38 140 0.998 3.3 0.79 9.6 6 1
In both files the last element in each line is class label.
I am comparing if the class labels are equal.
ie compare the classlabel of
line1:7.3 0.28 0.36 12.7 0.04 38 140 0.998 3.3 0.79 9.6 6 1
with
line2:7.3 0.28 0.36 12.7 0.04 38 140 0.998 3.3 0.79 9.6 6 1
Matches.
compare
line1:7.4 0.33 0.26 15.6 0.049 67 210 0.99907 3.06 0.68 9.5 5 1
with
line2:7.4 0.33 0.26 15.6 0.049 67 210 0.99907 3.06 0.68 9.5 5 -1
Not matches
Updated
What I did is
String line1;
String line2;
int notequalcnt = 0;
while((line1 = bfpart.readLine())!=null){
found = false;
while((line2 = bfin.readLine())!=null){
if(line1.equals(line2)){
found = true;
break;
}
else{
System.out.println("not equal");
notequalcnt++;
}
}
}
But I am getting every one as not equal.
Am I doing anything wrong.
After the first iteration itself, line2 becomes null. So, the loop will not execute again... Declare line2 buffer after the first while loop. Use this code:
public class CompareFile {
public static void main(String args[]) throws IOException{
String line1;
String line2;
boolean found;
int notequalcnt =0;
BufferedReader bfpart = new BufferedReader(new FileReader("file1.txt"));
while((line1 = bfpart.readLine())!=null){
found = false;
BufferedReader bfin = new BufferedReader(new FileReader("file2.txt"));
while((line2 = bfin.readLine())!=null){
System.out.println("line1"+line1);
System.out.println("line2"+line1);
if(line1.equals(line2)){
System.out.println("equal");
found = true;
break;
}
else{
System.out.println("not equal");
}
}
bfin.close();
if(found==false)
notequalcnt++;
}
bfpart.close();
}
}
You're comparing every line from file 1 with every line from file 2, and you are printing "not equal" every time any one of them doesn't match.
If file 2 has 6 lines, and you are looking for a given line from file 1 (say it's also in file 2), then 5 of the lines from file 2 won't match, and "not equal" will be output 5 times.
Your current implementation says "if any lines in file 2 don't match, it's not a match", but what you really mean is "if any lines in file 2 do match, it is a match". So your logic (pseudocode) should be more like this:
for each line in file 1 {
found = false
reset file 2 to beginning
for each line in file 2
if line 1 equals line 2
found = true, break.
if found
"found!"
else
"not found!"
}
Also you describe this as comparing "nth line of file 1 with nth line of file 2", but that's not actually what your implementation does. Your implementation is actually comparing the first line of file 1 with every line of file 2 then stopping, because you've already consumed every line of file 2 in that inner loop.
Your code has a lot of problems, and you probably need to sit back and work out your logic on paper first.
If the target is to compare and find the matching lines. Convert the file contents to an arraylist and compare the values.
Scanner s = new Scanner(new File("file1.txt"));
ArrayList<String> file1_list = new ArrayList<String>();
while (s.hasNext()){
file1_list .add(s.next());
}
s.close();
s = new Scanner(new File("file2.txt"));
ArrayList<String> file2_list = new ArrayList<String>();
while (s.hasNext()){
file2_list .add(s.next());
}
s.close();
for(String line1 : file1_list ){
if(file2_list.contains(line1)){
// found the line
}else{
// NOt found the line
}
}
Check Apache file Utils o compare files.
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html

Categories