Shuffle multiple files in same order - java

Setup:
I have 50 files, each with 25000 lines.
To-do:
I need to shuffle all of them "in the same order".
E.g.:
If before shuffle:
File 1 File 2 File 3
A A A
B B B
C C C
then after shuffle I should get:
File 1 File 2 File 3
B B B
C C C
A A A
i.e. corresponding rows in files should be shuffled in same order.
Also, the shuffle should be deterministic, i.e. if I give File A as input, it should always produce same shuffled output.
I can write a Java program to do it, probably a script to. Something like, shuffle number between 1 and 25000 and store that in a file, say shuffle_order. Then simply process one file at a time and order existing rows according to shuffle_order. But is there a better/quick way to do this?
Please let me know if more info needed.

The next uses only basic bash commands. The principe is:
generate a random order (numbers)
order all files in this order
the code
#!/bin/bash
case "$#" in
0) echo "Usage: $0 files....." ; exit 1;;
esac
ORDER="./.rand.$$"
trap "rm -f $ORDER;exit" 1 2
count=$(grep -c '^' "$1")
let odcount=$(($count * 4))
paste -d" " <(od -A n -N $odcount -t u4 /dev/urandom | grep -o '[0-9]*') <(seq -w $count) |\
sort -k1n | cut -d " " -f2 > $ORDER
#if your system has the "shuf" command you can replace the above 3 lines with a simple
#seq -w $count | shuf > $ORDER
for file in "$#"
do
paste -d' ' $ORDER $file | sort -k1n | cut -d' ' -f2- > "$file.rand"
done
echo "the order is in the file $ORDER" # remove this line
#rm -f $ORDER # and uncomment this
# if dont need preserve the order
paste -d " " *.rand #remove this line - it is only for showing test result
from the input files:
A B C
--------
a1 a2 a3
b1 b2 b3
c1 c2 c3
d1 d2 d3
e1 e2 e3
f1 f2 f3
g1 g2 g3
h1 h2 h3
i1 i2 i3
j1 j2 j3
will make A.rand B.rand C.rand with the next example content
g1 g2 g3
e1 e2 e3
b1 b2 b3
c1 c2 c3
f1 f2 f3
j1 j2 j3
d1 d2 d3
h1 h2 h3
i1 i2 i3
a1 a2 a3
real testing - genereting 50 files with 25k lines
line="Consequatur qui et qui. Mollitia expedita aut excepturi modi. Enim nihil et laboriosam sit a tenetur."
for n in $(seq -w 50)
do
seq -f "$line %g" 25000 >file.$n
done
running the script
bash sorter.sh file.??
result on my notebook
real 1m13.404s
user 0m56.127s
sys 0m5.143s

Probably very inefficient but try below:
#!/bin/bash
arr=( $(for i in {1..25000}; do
echo "$i"
done | shuf) )
for file in files*; do
index=0
new=$(while read line; do
echo "${arr[$index]} $line"
(( index++ ))
done < "$file" | sort -h | sed 's/^[0-9]\+ //')
echo "$new" > "$file"
done

I propose to shuffle them with a python script. By setting the same seed for every shuffling, you will obtain the same final data order.
import argparse
import logging
import os
import random
from tqdm import tqdm
logging.getLogger().setLevel(logging.INFO)
def main(args):
assert os.path.isfile(args.input_file), (
f"filename {args.input_file} does not exist"
)
logging.info("Reading input file...")
with open(args.input_file) as fi:
data = fi.readlines()
logging.info("Generating indexes")
indexes = list(range(len(data)))
logging.info("Shuffling...")
random.seed(args.seed)
random.shuffle(indexes)
logging.info(f"Writing results, in place? {args.in_place}")
if not args.in_place:
name, ext = os.path.splitext(args.input_file)
new_filename = name + "_shuffled" + ext
args.input_file = new_filename
with open(args.input_file, "w") as fo:
for index in tqdm(indexes, desc="Writing to output file..."):
fo.write(data[index])
fo.flush()
os.fsync(fo)
logging.info("Done!")
if __name__ == '__main__':
parser = argparse.ArgumentParser("Shuffle file by lines.")
parser.add_argument('--input_file', type=str, required=True, help="Input file to be shuffled")
parser.add_argument('--in_place', action="store_true", help="Whether to shuffle file in-place.")
parser.add_argument('--seed', type=int, required=True, help="Seed with which the file will be shuffled.")
args = parser.parse_args()
main(args)
You can run this script with:
python shuffle.py --input_file File1 --seed 123
python shuffle.py --input_file File1 --seed 123
python shuffle.py --input_file File1 --seed 123
And all the files will be shuffled in the same way.

Related

SearchRequest in RootDSE

I have to following function to query users from an AD server:
public List<LDAPUserDTO> getUsersWithPaging(String filter)
{
List<LDAPUserDTO> userList = new ArrayList<>();
try(LDAPConnection connection = new LDAPConnection(config.getHost(),config.getPort(),config.getUsername(),config.getPassword()))
{
SearchRequest searchRequest = new SearchRequest("", SearchScope.SUB,filter, null);
ASN1OctetString resumeCookie = null;
while (true)
{
searchRequest.setControls(
new SimplePagedResultsControl(100, resumeCookie));
SearchResult searchResult = connection.search(searchRequest);
for (SearchResultEntry e : searchResult.getSearchEntries())
{
LDAPUserDTO tmp = new LDAPUserDTO();
tmp.distinguishedName = e.getAttributeValue("distinguishedName");
tmp.name = e.getAttributeValue("name");
userList.add(tmp);
}
LDAPTestUtils.assertHasControl(searchResult,
SimplePagedResultsControl.PAGED_RESULTS_OID);
SimplePagedResultsControl responseControl =
SimplePagedResultsControl.get(searchResult);
if (responseControl.moreResultsToReturn())
{
resumeCookie = responseControl.getCookie();
}
else
{
break;
}
}
return userList;
} catch (LDAPException e) {
logger.error(e.getExceptionMessage());
return null;
}
}
However, this breaks when I try to search on the RootDSE.
What I've tried so far:
baseDN = null
baseDN = "";
baseDN = RootDSE.getRootDSE(connection).getDN()
baseDN = "RootDSE"
All resulting in various exceptions or empty results:
Caused by: LDAPSDKUsageException(message='A null object was provided where a non-null object is required (non-null index 0).
2020-04-01 10:42:22,902 ERROR [de.dbz.service.LDAPService] (default task-1272) LDAPException(resultCode=32 (no such object), numEntries=0, numReferences=0, diagnosticMessage='0000208D: NameErr: DSID-03100213, problem 2001 (NO_OBJECT), data 0, best match of:
''
', ldapSDKVersion=4.0.12, revision=aaefc59e0e6d110bf3a8e8a029adb776f6d2ce28')
So, I really spend a lot of time with this. It is possible to kind of query the RootDSE, but it's not that straight forward as someone might think.
I mainly used WireShark to see what the guys at Softerra are doing with their LDAP Browser.
Turns out I wasn't that far away:
As you can see, the baseObject is empty here.
Also, there is one additional Control with the OID LDAP_SERVER_SEARCH_OPTIONS_OID and the ASN.1 String 308400000003020102.
So what does this 308400000003020102 more readable: 30 84 00 00 00 03 02 01 02 actually do?
First of all, we decode this into something, we can read - in this case, this would be the int 2.
In binary, this gives us: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
As we know from the documentation, we have the following notation:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|-------|-------|
| x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | SSFPR | SSFDS |
or we just take the int values from the documentation:
1 = SSFDS -> SERVER_SEARCH_FLAG_DOMAIN_SCOPE
2 = SSFPR -> SERVER_SEARCH_FLAG_PHANTOM_ROOT
So, in my example, we have SSFPR which is defined as follows:
For AD DS, instructs the server to search all NC replicas except
application NC replicas that are subordinate to the search base, even
if the search base is not instantiated on the server. For AD LDS, the
behavior is the same except that it also includes application NC
replicas in the search. For AD DS and AD LDS, this will cause the
search to be executed over all NC replicas (except for application NCs
on AD DS DCs) held on the DC that are subordinate to the search base.
This enables search bases such as the empty string, which would cause
the server to search all of the NC replicas (except for application
NCs on AD DS DCs) that it holds.
NC stands for Naming Context and those are stored as Operational Attribute in the RootDSE with the name namingContexts.
The other value, SSFDS does the following:
Prevents continuation references from being generated when the search
results are returned. This performs the same function as the
LDAP_SERVER_DOMAIN_SCOPE_OID control.
So, someone might ask why I even do this. As it turns out, I got a customer with several sub DCs under one DC. If I tell the search to handle referrals, the execution time is pretty high and too long - therefore this wasn't really an option for me. But when I turn it off, I wasn't getting all the results when I was defining the BaseDN to be the group whose members I wanted to retrieve.
Searching via the RootDSE option in Softerra's LDAP Browser was way faster and returned the results in less then one second.
I personally don't have any clue why this is way faster - but the ActiveDirectory without any interface of tool from Microsoft is kind of black magic for me anyway. But to be frank, that's not really my area of expertise.
In the end, I ended up with the following Java code:
SearchRequest searchRequest = new SearchRequest("", SearchScope.SUB, filter, null);
[...]
Control globalSearch = new Control("1.2.840.113556.1.4.1340", true, new ASN1OctetString(Hex.decode("308400000003020102")));
searchRequest.setControls(new SimplePagedResultsControl(100, resumeCookie, true),globalSearch);
[...]
The used Hex.decode() is the following: org.bouncycastle.util.encoders.Hex.
A huge thanks to the guys at Softerra which more or less put my journey into the abyss of the AD to an end.
You can't query users from the RootDSE.
Use either a domain or if you need to query users from across domains in a forest use the global catalog (running on different ports, not the default 389 / 636 for LDAP(s).
RootDSE only contains metadata. Probably this question should be asked elsewhere for more information but first read up on the documentation from Microsoft, e.g.:
https://learn.microsoft.com/en-us/windows/win32/ad/where-to-search
https://learn.microsoft.com/en-us/windows/win32/adschema/rootdse
E.g.: namingContexts attribute can be read to find which other contexts you may want to query for actual users.
Maybe start with this nice article as introduction:
http://cbtgeeks.com/2016/06/02/what-is-rootdse/

Spark: Splitting using delimiter doesn't work with commas

I am working on Spark SQL with Spark(2.2) and using Java API for loading data from a CSV file.
In the CSV file there is quotes inside cells, the column separater is a pipe |.
Line example: 2012|"Hello|World"
This my code for reading a CSV and returning Dataset:
session = SparkSession.builder().getOrCreate();
Dataset<Row>=session.read().option("header", "true").option("delimiter", |).csv(filePath);
This is what I got
+-----+--------------+--------------------------+
|Year | c1 | c2 |
+-----+--------------+--------------------------+
|2012 |Hello|World + null |
+-----+--------------+--------------------------+
The expected result is this:
+-----+--------------+--------------------------+
|Year | c1 | c2 |
+-----+--------------+--------------------------+
|2012 |"Hello + World" |
+-----+--------------+--------------------------+
The only thing I can think of is deleting the commas ' " ', but this out of question because I dont want to change the values of the cells.
I would appreciate any ideas, thanks.
Try this :
Dataset<Row> test = spark.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "|")
.option("quote", " ")
.load(filePath);

Finding substring from a string using regex java

I have a String:
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
I want to get the path (in this case D:\\workdir\\PV 81\\config\\sum81pv.pwf) from this string. This path is an argument of a command option -sn or -n, so this path always appears after these options.
The path may or may not contain whitespaces, which needs to be handled.
public class TestClass {
public static void main(String[] args) {
String path;
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
path = s.replaceAll(".*(-sn|-n) \"?([^ ]*)?", "$2");
System.out.println("Path: " + path);
}
}
Current output: Path: D:\workdir\PV 81\config\sum81pv.pwf -C 5000
Expected output: Path: D:\workdir\PV 81\config\sum81pv.pwf
Below Answers working fine for the earlier case.
i need a regex which return `*.pwf` path if the option is `-sn, -n, -s, -s -n, or without -s or -n.`
But if I have below case then what would be the regex to find password file.
String s1 = msqllab91 0 0 1 50 50 60 /mti/root/bin/msqlora -n "tmp/my.pwf" -s
String s2 = msqllab92 0 0 1 50 50 60 /mti/root/bin/msqlora -s -n /mti/root/my.pwf
String s3 = msqllab93 0 0 1 50 50 60 msqlora -s -n "/mti/root/my.pwf" -C 10000
String s4 = msqllab94 0 0 1 50 50 60 msqlora.exe -sn /mti/root/my.pwf
String s5 = msqllab95 0 0 1 50 50 60 msqlora.exe -sn "/mti/root"/my.pwf
String s6 = msqllab96 0 0 1 50 50 60 msqlora.exe -sn"/mti/root"/my.pwf
String s7 = msqllab97 0 0 1 50 50 60 "/mti/root/bin/msqlora" -s -n /mti/root/my.pwf -s
String s8 = msqllab98 0 0 1 50 50 60 /mti/root/bin/msqlora -s
String s9 = msqllab99 0 0 1 50 50 60 /mti/root/bin/msqlora -s -n /mti/root/my.NOTpwf -s -n /mti/root/my.pwf
String s10 = msqllab90 0 0 1 50 50 60 /mti/root/bin/msqlora -sn /mti/root/my.NOTpwf -sn /mti/root/my.pwf
String s11 = msqllab901 0 0 1 50 50 60 /mti/root/bin/msqlora
String s12 = msqllab902 0 0 1 50 50 60 /mti/root/msqlora-n NOTmy.pwf
String s13 = msqllab903 0 0 1 50 50 60 /mti/root/msqlora-n.exe NOTmy.pwf
i need a regex which return *.pwf path if the option is -sn, -n, -s, -s -n, or without -s or -n.
path contains *.pwf file extension only not NOTpwf or any other extension and code should all work except the last two because it is an invalid command.
Note: I already asked this type of question but didn't get anything working as per my requirement. (How to get specific substring with option vale using java)
You can use:
path = s.replaceFirst(".*\\s-s?n\\s*(.+?)(?:\\s-.*|$)", "$1");
//=> D:\workdir\PV 81\config\sum81pv.pwf
Code Demo
RegEx Demo
Try this
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
int l=s.indexOf("-sn");
int l1=s.indexOf("-C");
System.out.println(s.substring(l+4,l1-2));
You can also use : [A-Z]:.*\.\w+
Demo and Explaination
Rather than using complex regexps for replacing, I'd rather suggest a simpler one for matching:
String s = "msqlsum81pv 0 0 25 25 25 2 -sn D:\\workdir\\PV 81\\config\\sum81pv.pwf -C 5000";
Pattern pattern = Pattern.compile("\\s-s?n\\s*(.*?)\\s*-C\\s+\\d+$");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group(1));
}
// => D:\workdir\PV 81\config\sum81pv.pwf
See the IDEONE Demo
If the -C <NUMBER> is optional at the end, wrap with an optional group -> (?:\\s*-C\\s+\\d+)?$.
Pattern details:
\\s - a whitespace
-s?n - a -sn or -n (as s? matches an optional s)
\\s* - 0+ whitespaces
(.*?) - Group 1 matching any 0+ chars other than a newline
\\s* - ibid
-C - a literal -C
\\s+ - 1+ whitespaces
\\d+ - 1 or more digits
$ - end of string.

Transferring data structure from R to Java

I have an R script that does some computation. The last step of the computation is a kernel density estimate: http://www.inside-r.org/packages/cran/kerdiest/docs/kde
I now, in R, need to convert the result of calling kde into a string, or save it into a file, such that I can read and "unmarshal" it from a Java program.
What is the best format to use for the exchange and what R and Java libraries can read / write that format?
The structure is not ridiculously complex, but also not trivial:
> str(tmp)
List of 8
$ x : num [1:1398, 1:3] 1.035 0.902 0.679 0.826 1.243 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "Rb ppm" "Sb ppm" "Cr ppm"
$ eval.points:'data.frame': 1398 obs. of 3 variables:
..$ Rb ppm: num [1:1398] 1.035 0.902 0.679 0.826 1.243 ...
..$ Sb ppm: num [1:1398] -2.58 -2.6 -2.48 -2.44 -2.53 ...
..$ Cr ppm: num [1:1398] 4.56 4.44 4.3 4.26 4.49 ...
$ estimate : Named num [1:1398] 0.1572 0.0897 0.0311 0.0434 0.099 ...
..- attr(*, "names")= chr [1:1398] "1" "2" "3" "4" ...
$ H : num [1:3, 1:3] 0.02395 0.00927 -0.014 0.00927 0.06868 ...
$ gridded : logi FALSE
$ binned : logi FALSE
$ names : chr [1:3] "Rb ppm" "Sb ppm" "Cr ppm"
$ w : num [1:1398] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "class")= chr "kde"
RJSONIO seems to do the job. It seems quite verbose, however.

Why doesn't ANTLR recognise this rule the way I expect?

I'm using ANTLR to replace an existing (small) parser I currently have. Here is a snippet of the file I am trying to parse:
Lurker 915236167 10 2 Bk cc b b 1000 70 200 Jc Qs
Lurker 915236237 10 1 Bc kf - - 1130 10 0
Lurker 915236302 10 10 c c rc b 1120 110 305 6d Kd
Lurker 915236381 10 9 c f - - 1315 20 0
Lurker 915236425 10 8 cc f - - 1295 30 0
Here is Shared.g:
lexer grammar Shared;
NICK
: LETTER (LETTER | NUMBER | SPECIAL)*
;
fragment
LETTER
: 'A'..'Z'
| 'a'..'z'
| '_'
;
NUMBER
: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9')+
;
fragment
SPECIAL
: ('-'|'^'|'{'|'}'|'|'|'['|']'|'`'|'\\')
;
WS
: ( ' '
| '\t'
| '\r'
| '\n'
)+
;
And Pdb.g:
grammar Pdb;
import Shared;
#header{
import java.util.ArrayList;
import java.sql.Connection;
}
#members{
private Connection conn;
private StringBuilder currentExpr = new StringBuilder(500);
ArrayList<String> players = new ArrayList<String>(10);
public void setConn(Connection conn){
this.conn = conn;
}
}
pdb
: line+
;
line
#after{
currentExpr.append("execute player_handplan(");
currentExpr.append($nick.text);
currentExpr.append(", to_timestamp(");
currentExpr.append(Integer.parseInt($timestamp.text));
currentExpr.append("), ");
currentExpr.append(Integer.parseInt($n_players.text));
currentExpr.append(", ");
currentExpr.append(Integer.parseInt($position.text));
currentExpr.append(", ");
currentExpr.append($action_p.text);
currentExpr.append(", ");
currentExpr.append($action_f.text);
currentExpr.append(", ");
currentExpr.append($action_t.text);
currentExpr.append(", ");
currentExpr.append($action_r.text);
currentExpr.append(", ");
currentExpr.append(Integer.parseInt($bankroll.text));
currentExpr.append(", ");
currentExpr.append(Integer.parseInt($total_action.text));
currentExpr.append(", ");
currentExpr.append(Integer.parseInt($amount_won.text));
currentExpr.append(", ");
currentExpr.append("CARDS");
currentExpr.append(");");
System.out.println(currentExpr.toString());
currentExpr = new StringBuilder(500);
}
: nick=NICK WS
timestamp=NUMBER WS
n_players=NUMBER WS
position=NUMBER WS
action_p=action WS
action_f=action WS
action_t=action WS
action_r=action WS
bankroll=NUMBER WS
total_action=NUMBER WS
amount_won=NUMBER WS
(NICK WS NICK WS)? // ignore this
;
action
: '-'
| ('B'|'f'|'k'|'b'|'c'|'r'|'A'|'Q'|'K')+
;
My problem is, when I run the parser, I get the following error:
cal#lambda:~/src/DecisionTrees/grammar/output$ cat example | java Test
line 1:26 no viable alternative at input 'Bk'
line 1:30 no viable alternative at input 'cc'
execute player_handplan(Lurker, to_timestamp(915236167), 10, 2, null, null, b, b, 1000, 70, 200, CARDS);
Why won't my grammar accept "Bk", even though it will accept "b"? I feel like there is something obvious I am overlooking. Thanks in advance
Why don't you use {$channel=HIDDEN} in rule WS and leave them out of the line rule.
That way at least you won't get in trouble for putting one too many WS by accident.
And if action can only have 2 chars max maybe trying this will help:
action
: '-'
| ('B'|'f'|'k'|'b'|'c'|'r'|'A'|'Q'|'K')('B'|'f'|'k'|'b'|'c'|'r'|'A'|'Q'|'K')?
;

Categories