Transferring data structure from R to Java - java

I have an R script that does some computation. The last step of the computation is a kernel density estimate: http://www.inside-r.org/packages/cran/kerdiest/docs/kde
I now, in R, need to convert the result of calling kde into a string, or save it into a file, such that I can read and "unmarshal" it from a Java program.
What is the best format to use for the exchange and what R and Java libraries can read / write that format?
The structure is not ridiculously complex, but also not trivial:
> str(tmp)
List of 8
$ x : num [1:1398, 1:3] 1.035 0.902 0.679 0.826 1.243 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "Rb ppm" "Sb ppm" "Cr ppm"
$ eval.points:'data.frame': 1398 obs. of 3 variables:
..$ Rb ppm: num [1:1398] 1.035 0.902 0.679 0.826 1.243 ...
..$ Sb ppm: num [1:1398] -2.58 -2.6 -2.48 -2.44 -2.53 ...
..$ Cr ppm: num [1:1398] 4.56 4.44 4.3 4.26 4.49 ...
$ estimate : Named num [1:1398] 0.1572 0.0897 0.0311 0.0434 0.099 ...
..- attr(*, "names")= chr [1:1398] "1" "2" "3" "4" ...
$ H : num [1:3, 1:3] 0.02395 0.00927 -0.014 0.00927 0.06868 ...
$ gridded : logi FALSE
$ binned : logi FALSE
$ names : chr [1:3] "Rb ppm" "Sb ppm" "Cr ppm"
$ w : num [1:1398] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "class")= chr "kde"

RJSONIO seems to do the job. It seems quite verbose, however.

Related

Accessing IMDB dataset in AWS in R, Python or Java

I am trying to connect to IMDB dataset in AWS.
I've already signed up for AWS and set up the credential.
I'm more familiar with R and apparently there's R package called aws.s3. And when I used s3HTTP function, I get errors as below
s3HTTP(verb="GET", bucket="imdb-datasets", path="documents/v1/current/name.basics.tsv.gz",
request_body = "documents/v1/current/name.basics.tsv.gz",
headers=list('x-amz-request-payer' = "requester"),
key=Sys.setenv("AWS_ACCESS_KEY_ID"="*******"), secret=Sys.setenv("AWS_SECRET_KEY"="******"))
List of 5
$ Code : chr "InvalidAccessKeyId"
$ Message : chr "The AWS Access Key Id you provided does not
exist in our records."
$ AWSAccessKeyId: chr "TRUE"
$ RequestId : chr "234D5ED951AD2468"
$ HostId : chr "ugVtbV2Qz6NrNFD7ODO84MnzYttftsjHwbAawExo75Bg9xq3JAXOuDqF8GcYLd5vD6TgcHe/ib4="
- attr(*, "headers")=List of 6
..$ x-amz-request-id : chr "234D5ED951AD2468"
..$ x-amz-id-2 : chr "ugVtbV2Qz6NrNFD7ODO84MnzYttftsjHwbAawExo75Bg9xq3JAXOuDqF8GcYLd5vD6TgcHe/ib4="
..$ content-type : chr "application/xml"
..$ transfer-encoding: chr "chunked"
..$ date : chr "Mon, 20 Nov 2017 08:37:13 GMT"
..$ server : chr "AmazonS3"
..- attr(*, "class")= chr [1:2] "insensitive" "list"
- attr(*, "class")= chr "aws_error"
- attr(*, "request_canonical")= chr "GET\n/imdb-
datasets/\nlocation=\nhost:s3.amazonaws.com\nx-amz-
date:20171120T083712Z\n\nhost;x-amz-date\ne3b0c44"| __truncated__
- attr(*, "request_string_to_sign")= chr "AWS4-HMAC-
SHA256\n20171120T083712Z\n20171120/us-east-
1/s3/aws4_request\n760638139d8fa8fa1e36b824f481abe59184955"| __truncated__
- attr(*, "request_signature")= chr "AWS4-HMAC-SHA256
Credential=TRUE/20171120/us-east-1/s3/aws4_request,
SignedHeaders=host;x-amz-date, Signature=b"| __truncated__
NULL
My access key is up to date and I have no problem accessing my own bucket.
I also copied the java example codes provided by IMDB on their webpage (http://www.imdb.com/interfaces/), and it seemed to be compiling without errors, but there's no file downloaded in my bucket in AWS.

Finding substring from a string using regex java

I have a String:
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
I want to get the path (in this case D:\\workdir\\PV 81\\config\\sum81pv.pwf) from this string. This path is an argument of a command option -sn or -n, so this path always appears after these options.
The path may or may not contain whitespaces, which needs to be handled.
public class TestClass {
public static void main(String[] args) {
String path;
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
path = s.replaceAll(".*(-sn|-n) \"?([^ ]*)?", "$2");
System.out.println("Path: " + path);
}
}
Current output: Path: D:\workdir\PV 81\config\sum81pv.pwf -C 5000
Expected output: Path: D:\workdir\PV 81\config\sum81pv.pwf
Below Answers working fine for the earlier case.
i need a regex which return `*.pwf` path if the option is `-sn, -n, -s, -s -n, or without -s or -n.`
But if I have below case then what would be the regex to find password file.
String s1 = msqllab91 0 0 1 50 50 60 /mti/root/bin/msqlora -n "tmp/my.pwf" -s
String s2 = msqllab92 0 0 1 50 50 60 /mti/root/bin/msqlora -s -n /mti/root/my.pwf
String s3 = msqllab93 0 0 1 50 50 60 msqlora -s -n "/mti/root/my.pwf" -C 10000
String s4 = msqllab94 0 0 1 50 50 60 msqlora.exe -sn /mti/root/my.pwf
String s5 = msqllab95 0 0 1 50 50 60 msqlora.exe -sn "/mti/root"/my.pwf
String s6 = msqllab96 0 0 1 50 50 60 msqlora.exe -sn"/mti/root"/my.pwf
String s7 = msqllab97 0 0 1 50 50 60 "/mti/root/bin/msqlora" -s -n /mti/root/my.pwf -s
String s8 = msqllab98 0 0 1 50 50 60 /mti/root/bin/msqlora -s
String s9 = msqllab99 0 0 1 50 50 60 /mti/root/bin/msqlora -s -n /mti/root/my.NOTpwf -s -n /mti/root/my.pwf
String s10 = msqllab90 0 0 1 50 50 60 /mti/root/bin/msqlora -sn /mti/root/my.NOTpwf -sn /mti/root/my.pwf
String s11 = msqllab901 0 0 1 50 50 60 /mti/root/bin/msqlora
String s12 = msqllab902 0 0 1 50 50 60 /mti/root/msqlora-n NOTmy.pwf
String s13 = msqllab903 0 0 1 50 50 60 /mti/root/msqlora-n.exe NOTmy.pwf
i need a regex which return *.pwf path if the option is -sn, -n, -s, -s -n, or without -s or -n.
path contains *.pwf file extension only not NOTpwf or any other extension and code should all work except the last two because it is an invalid command.
Note: I already asked this type of question but didn't get anything working as per my requirement. (How to get specific substring with option vale using java)
You can use:
path = s.replaceFirst(".*\\s-s?n\\s*(.+?)(?:\\s-.*|$)", "$1");
//=> D:\workdir\PV 81\config\sum81pv.pwf
Code Demo
RegEx Demo
Try this
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
int l=s.indexOf("-sn");
int l1=s.indexOf("-C");
System.out.println(s.substring(l+4,l1-2));
You can also use : [A-Z]:.*\.\w+
Demo and Explaination
Rather than using complex regexps for replacing, I'd rather suggest a simpler one for matching:
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
Pattern pattern = Pattern.compile("\\s-s?n\\s*(.*?)\\s*-C\\s+\\d+$");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group(1));
}
// => D:\workdir\PV 81\config\sum81pv.pwf
See the IDEONE Demo
If the -C <NUMBER> is optional at the end, wrap with an optional group -> (?:\\s*-C\\s+\\d+)?$.
Pattern details:
\\s - a whitespace
-s?n - a -sn or -n (as s? matches an optional s)
\\s* - 0+ whitespaces
(.*?) - Group 1 matching any 0+ chars other than a newline
\\s* - ibid
-C - a literal -C
\\s+ - 1+ whitespaces
\\d+ - 1 or more digits
$ - end of string.

java.lang.ClassCastException: com.itextpdf.text.pdf.PdfNull cannot be cast to com.itextpdf.text.pdf.PdfDictionary

Getting below Exception while trying to read byte array using iText PdfReader,
Below is my code, I'm able to open this file in Acrobat reader
PdfReader reader = new PdfReader(bFile);
Exception:
java.lang.ClassCastException: com.itextpdf.text.pdf.PdfNull cannot be cast to com.itextpdf.text.pdf.PdfDictionary
at com.itextpdf.text.pdf.PdfReader$PageRefs.iteratePages(PdfReader.java:3712)
at com.itextpdf.text.pdf.PdfReader$PageRefs.iteratePages(PdfReader.java:3743)
at com.itextpdf.text.pdf.PdfReader$PageRefs.readPages(PdfReader.java:3548)
at com.itextpdf.text.pdf.PdfReader$PageRefs.<init>(PdfReader.java:3518)
at com.itextpdf.text.pdf.PdfReader$PageRefs.<init>(PdfReader.java:3496)
at com.itextpdf.text.pdf.PdfReader.readPages(PdfReader.java:1142)
at com.itextpdf.text.pdf.PdfReader.readPdf(PdfReader.java:659)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:176)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:244)
at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:234)
Im using iText 5.4.4, I couldn't find much details in googling. It looks PDF has some issues, couldn't get whats the issue. Below is the excerpts from PDF
%PDF-1.5
%âãÏÓ
1 0 obj
<<
/Type /Catalog
/Lang (en-US)
/StructTreeRoot 39 0 R
/MarkInfo <<
/Marked true
>>
/Pages 187 0 R
/AcroForm 350 0 R
/OCProperties 2131 0 R
/Outlines 2531 0 R
/OpenAction <<
/Type /Action
/S /GoTo
/D [ 3 0 R /XYZ 0 792 0 ]
>>
/ViewerPreferences <<
/HideToolbar false
/HideMenubar false
/HideWindowUI false
/FitWindow false
/CenterWindow false
>>
UPDATE: After debugging I found that /Pages 187 0 R is the problem. If I change to /Pages 2 0 R then it works. Could some please help me what does that /Pages refers ?

Shuffle multiple files in same order

Setup:
I have 50 files, each with 25000 lines.
To-do:
I need to shuffle all of them "in the same order".
E.g.:
If before shuffle:
File 1 File 2 File 3
A A A
B B B
C C C
then after shuffle I should get:
File 1 File 2 File 3
B B B
C C C
A A A
i.e. corresponding rows in files should be shuffled in same order.
Also, the shuffle should be deterministic, i.e. if I give File A as input, it should always produce same shuffled output.
I can write a Java program to do it, probably a script to. Something like, shuffle number between 1 and 25000 and store that in a file, say shuffle_order. Then simply process one file at a time and order existing rows according to shuffle_order. But is there a better/quick way to do this?
Please let me know if more info needed.
The next uses only basic bash commands. The principe is:
generate a random order (numbers)
order all files in this order
the code
#!/bin/bash
case "$#" in
0) echo "Usage: $0 files....." ; exit 1;;
esac
ORDER="./.rand.$$"
trap "rm -f $ORDER;exit" 1 2
count=$(grep -c '^' "$1")
let odcount=$(($count * 4))
paste -d" " <(od -A n -N $odcount -t u4 /dev/urandom | grep -o '[0-9]*') <(seq -w $count) |\
sort -k1n | cut -d " " -f2 > $ORDER
#if your system has the "shuf" command you can replace the above 3 lines with a simple
#seq -w $count | shuf > $ORDER
for file in "$#"
do
paste -d' ' $ORDER $file | sort -k1n | cut -d' ' -f2- > "$file.rand"
done
echo "the order is in the file $ORDER" # remove this line
#rm -f $ORDER # and uncomment this
# if dont need preserve the order
paste -d " " *.rand #remove this line - it is only for showing test result
from the input files:
A B C
--------
a1 a2 a3
b1 b2 b3
c1 c2 c3
d1 d2 d3
e1 e2 e3
f1 f2 f3
g1 g2 g3
h1 h2 h3
i1 i2 i3
j1 j2 j3
will make A.rand B.rand C.rand with the next example content
g1 g2 g3
e1 e2 e3
b1 b2 b3
c1 c2 c3
f1 f2 f3
j1 j2 j3
d1 d2 d3
h1 h2 h3
i1 i2 i3
a1 a2 a3
real testing - genereting 50 files with 25k lines
line="Consequatur qui et qui. Mollitia expedita aut excepturi modi. Enim nihil et laboriosam sit a tenetur."
for n in $(seq -w 50)
do
seq -f "$line %g" 25000 >file.$n
done
running the script
bash sorter.sh file.??
result on my notebook
real 1m13.404s
user 0m56.127s
sys 0m5.143s
Probably very inefficient but try below:
#!/bin/bash
arr=( $(for i in {1..25000}; do
echo "$i"
done | shuf) )
for file in files*; do
index=0
new=$(while read line; do
echo "${arr[$index]} $line"
(( index++ ))
done < "$file" | sort -h | sed 's/^[0-9]\+ //')
echo "$new" > "$file"
done
I propose to shuffle them with a python script. By setting the same seed for every shuffling, you will obtain the same final data order.
import argparse
import logging
import os
import random
from tqdm import tqdm
logging.getLogger().setLevel(logging.INFO)
def main(args):
assert os.path.isfile(args.input_file), (
f"filename {args.input_file} does not exist"
)
logging.info("Reading input file...")
with open(args.input_file) as fi:
data = fi.readlines()
logging.info("Generating indexes")
indexes = list(range(len(data)))
logging.info("Shuffling...")
random.seed(args.seed)
random.shuffle(indexes)
logging.info(f"Writing results, in place? {args.in_place}")
if not args.in_place:
name, ext = os.path.splitext(args.input_file)
new_filename = name + "_shuffled" + ext
args.input_file = new_filename
with open(args.input_file, "w") as fo:
for index in tqdm(indexes, desc="Writing to output file..."):
fo.write(data[index])
fo.flush()
os.fsync(fo)
logging.info("Done!")
if __name__ == '__main__':
parser = argparse.ArgumentParser("Shuffle file by lines.")
parser.add_argument('--input_file', type=str, required=True, help="Input file to be shuffled")
parser.add_argument('--in_place', action="store_true", help="Whether to shuffle file in-place.")
parser.add_argument('--seed', type=int, required=True, help="Seed with which the file will be shuffled.")
args = parser.parse_args()
main(args)
You can run this script with:
python shuffle.py --input_file File1 --seed 123
python shuffle.py --input_file File1 --seed 123
python shuffle.py --input_file File1 --seed 123
And all the files will be shuffled in the same way.

Why doesn't ANTLR recognise this rule the way I expect?

I'm using ANTLR to replace an existing (small) parser I currently have. Here is a snippet of the file I am trying to parse:
Lurker 915236167 10 2 Bk cc b b 1000 70 200 Jc Qs
Lurker 915236237 10 1 Bc kf - - 1130 10 0
Lurker 915236302 10 10 c c rc b 1120 110 305 6d Kd
Lurker 915236381 10 9 c f - - 1315 20 0
Lurker 915236425 10 8 cc f - - 1295 30 0
Here is Shared.g:
lexer grammar Shared;
NICK
: LETTER (LETTER | NUMBER | SPECIAL)*
;
fragment
LETTER
: 'A'..'Z'
| 'a'..'z'
| '_'
;
NUMBER
: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9')+
;
fragment
SPECIAL
: ('-'|'^'|'{'|'}'|'|'|'['|']'|'`'|'\\')
;
WS
: ( ' '
| '\t'
| '\r'
| '\n'
)+
;
And Pdb.g:
grammar Pdb;
import Shared;
#header{
import java.util.ArrayList;
import java.sql.Connection;
}
#members{
private Connection conn;
private StringBuilder currentExpr = new StringBuilder(500);
ArrayList<String> players = new ArrayList<String>(10);
public void setConn(Connection conn){
this.conn = conn;
}
}
pdb
: line+
;
line
#after{
currentExpr.append("execute player_handplan(");
currentExpr.append($nick.text);
currentExpr.append(", to_timestamp(");
currentExpr.append(Integer.parseInt($timestamp.text));
currentExpr.append("), ");
currentExpr.append(Integer.parseInt($n_players.text));
currentExpr.append(", ");
currentExpr.append(Integer.parseInt($position.text));
currentExpr.append(", ");
currentExpr.append($action_p.text);
currentExpr.append(", ");
currentExpr.append($action_f.text);
currentExpr.append(", ");
currentExpr.append($action_t.text);
currentExpr.append(", ");
currentExpr.append($action_r.text);
currentExpr.append(", ");
currentExpr.append(Integer.parseInt($bankroll.text));
currentExpr.append(", ");
currentExpr.append(Integer.parseInt($total_action.text));
currentExpr.append(", ");
currentExpr.append(Integer.parseInt($amount_won.text));
currentExpr.append(", ");
currentExpr.append("CARDS");
currentExpr.append(");");
System.out.println(currentExpr.toString());
currentExpr = new StringBuilder(500);
}
: nick=NICK WS
timestamp=NUMBER WS
n_players=NUMBER WS
position=NUMBER WS
action_p=action WS
action_f=action WS
action_t=action WS
action_r=action WS
bankroll=NUMBER WS
total_action=NUMBER WS
amount_won=NUMBER WS
(NICK WS NICK WS)? // ignore this
;
action
: '-'
| ('B'|'f'|'k'|'b'|'c'|'r'|'A'|'Q'|'K')+
;
My problem is, when I run the parser, I get the following error:
cal#lambda:~/src/DecisionTrees/grammar/output$ cat example | java Test
line 1:26 no viable alternative at input 'Bk'
line 1:30 no viable alternative at input 'cc'
execute player_handplan(Lurker, to_timestamp(915236167), 10, 2, null, null, b, b, 1000, 70, 200, CARDS);
Why won't my grammar accept "Bk", even though it will accept "b"? I feel like there is something obvious I am overlooking. Thanks in advance
Why don't you use {$channel=HIDDEN} in rule WS and leave them out of the line rule.
That way at least you won't get in trouble for putting one too many WS by accident.
And if action can only have 2 chars max maybe trying this will help:
action
: '-'
| ('B'|'f'|'k'|'b'|'c'|'r'|'A'|'Q'|'K')('B'|'f'|'k'|'b'|'c'|'r'|'A'|'Q'|'K')?
;

Categories