Get lines of code in src folder skipping comments and empty lines - java

Hi I already have something to get the lines of code but still it out puts the count having empty lines and comments counted.
git ls-files | grep "\.java$" | xargs wc -l
Can you modify this to skip comments and blank lines..?
Thanks in advance

Try CLOC, it can lists numbers in great detail.
You need to install CLOC first using syntax brew install cloc
cloc $(git ls-files)
Sample output for reference:
20 text files.
20 unique files.
6 files ignored.
http://cloc.sourceforge.net v 1.62 T=0.22 s (62.5 files/s, 2771.2 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Javascript 2 13 111 309
JSON 3 0 0 58
HTML 2 7 12 50
Handlebars 2 0 0 37
CoffeeScript 4 1 4 12
SASS 1 1 1 5
-------------------------------------------------------------------------------
SUM: 14 22 128 471
-------------------------------------------------------------------------------

Most test coverage tools also calculate LOC. For instance, Jacoco spits this out to Jenkins.

Related

What will be the Regular Expression to parse give output

I want to extract the code count for the below command output. In the below example expected output is 286. What will be the regular expression to extract the code count?
Want to parse the below string in windows:
1 text file.
1 unique file.
0 files ignored.
http://cloc.sourceforge.net v 1.64 T=0.01 s (86.1 files/s, 1119.4 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
C++ 1 0 0 13
-------------------------------------------------------------------------------
You can grep the last digits of a line by
grep -P "\d+$"
I can parse it in Linux using below:
grep -Eo '[0-9]+$'

Creating dictionary from large Pyspark dataframe showing OutOfMemoryError: Java heap space

I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores. As per this suggestion I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.
I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,
property_sql_df.show() shows,
+--------------+------------+--------------------+--------------------+
| id|country_code| name| hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
| BOND-9129450| US|Scotron Home w/Ga...|90cb0946cf4139e12...|
| BOND-1742850| US|Sited in the Mead...|d5c301f00e9966483...|
| BOND-3211356| US|NEW LISTING - Com...|811fa26e240d726ec...|
| BOND-7630290| US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
| BOND-7175508| US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+
What I want is to make a dictionary with hash_of_cc_pn_li as key and id as a list value.
Expected Output
{
"90cb0946cf4139e12": ["BOND-9129450", "BOND-7175508"]
"d5c301f00e9966483": ["BOND-1742850","BOND-7630290"]
}
What I have tried so far,
Way 1: causing java.lang.OutOfMemoryError: Java heap space
%%time
duplicate_property_list = {}
for ind in property_sql_df.collect():
hashed_value = ind.hash_of_cc_pn_li
property_id = ind.id
if hashed_value in duplicate_property_list:
duplicate_property_list[hashed_value].append(property_id)
else:
duplicate_property_list[hashed_value] = [property_id]
Way 2: Not working because of missing native OFFSET on pyspark
%%time
i = 0
limit = 1000000
for offset in range(0, total_record,limit):
i = i + 1
if i != 1:
offset = offset + 1
duplicate_property_list = {}
duplicate_properties = {}
# Preparing dataframe
url = '''select id, hash_of_cc_pn_li from properties_df LIMIT {} OFFSET {}'''.format(limit,offset)
properties_sql_df = spark.sql(url)
# Grouping dataset
rows = properties_sql_df.groupBy("hash_of_cc_pn_li").agg(F.collect_set("id").alias("ids")).collect()
duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }
# Filter a dictionary to keep elements only where duplicate cound
duplicate_properties = filterTheDict(duplicate_property_list, lambda elem : len(elem[1]) >=2)
# Writing to file
with open('duplicate_detected/duplicate_property_list_all_'+str(i)+'.json', 'w') as fp:
json.dump(duplicate_property_list, fp)
What I get now on the console:
java.lang.OutOfMemoryError: Java heap space
and showing this error on Jupyter notebook output
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)
This is the followup question that I asked here: Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space
Why not keep as much data and processing in Executors, rather than collecting to Driver? If I understand this correctly, you could use pyspark transformations and aggregations and save directly to JSON, therefore leveraging executors, then load that JSON file (likely partitioned) back into Python as a dictionary. Admittedly, you introduce IO overhead, but this should allow you to get around your OOM heap space errors. Step-by-step:
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
data = [
("BOND-9129450", "90cb"),
("BOND-1742850", "d5c3"),
("BOND-3211356", "811f"),
("BOND-7630290", "d5c3"),
("BOND-7175508", "90cb"),
]
df = spark.createDataFrame(data, ["id", "hash_of_cc_pn_li"])
df.groupBy(
f.col("hash_of_cc_pn_li"),
).agg(
f.collect_set("id").alias("id") # use f.collect_list() here if you're not interested in deduplication of BOND-XXXXX values
).write.json("./test.json")
Inspecting the output path:
ls -l ./test.json
-rw-r--r-- 1 jovyan users 0 Jul 27 08:29 part-00000-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 50 Jul 27 08:29 part-00039-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00043-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00159-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 0 Jul 27 08:29 _SUCCESS
_SUCCESS
Loading to Python as dict:
import json
from glob import glob
data = []
for file_name in glob('./test.json/*.json'):
with open(file_name) as f:
try:
data.append(json.load(f))
except json.JSONDecodeError: # there is definitely a better way - this is here because some partitions might be empty
pass
Finally
{item['hash_of_cc_pn_li']:item['id'] for item in data}
{'d5c3': ['BOND-7630290', 'BOND-1742850'],
'811f': ['BOND-3211356'],
'90cb': ['BOND-9129450', 'BOND-7175508']}
I hope this helps! Thank you for the good question!

Getting test cases skipped in CHelper plugin for IntelliJ Idea

I am using CHelper plugin for IntelliJ Idea. Every time I write code and compile it, I get output like:
Test #0: SKIPPED
Test #1: SKIPPED
Test #2: SKIPPED
Test #3:
Input:
10 4
40 10 20 70 80 10 20 70 80 60
Expected output:
40
Execution result:
70
Verdict: Wrong Answer (Difference in token #0) in 0.000 s.
------------------------------------------------------------------
Test results:
Process finished with exit code 0
Here the output for Test #4 was wrong. But if the output is correct for all test cases, the following output is shown:
Test #0: Input: 5 3 10 30 40 50 20
Expected output: 30
Execution result: 30
Verdict: OK in 0.000 s.
Test #1: Input: 3 1 10 20 10
Expected output: 20
Execution result: 20
Verdict: OK in 0.001 s.
Test #2: Input: 2 100 10 10
Expected output: 0
Execution result: 0
Verdict: OK in 0.000 s.
Test #3: Input: 10 4 40 10 20 70 80 10 20 70 80 60
Expected output: 70 Execution result: 70
Verdict: OK in 0.000 s.
------------------------------------------------------------------
Test results: All test passed input 0.001 s.
Process finished with exit code 0
I am unable to understand why the test cases are being skipped. Can anybody explain why this is happening? And if yes, how to prevent the test cases from being skipped.
When smart testing is enabled in CHelper, if one of the testcases fails, the rest of the testcases will be skipped.
To disable smart testing:
Click on Edit>Project>Settings in the toolbar. Screenshot
In Project Settings, uncheck Use smart testing. Screenshot

Flattening JSON data into individual rows

I am interested in flattening JSON with multiple layers of nested arrays of object. I would ideally like to do this in Java but it seems like the Pandas library in python might be good for this.
Does anyone know a good java library for this?
I found this article (Create a Pandas DataFrame from deeply nested JSON) using pandas and jq and my solution almost works but the output I am receiving is not quite as expected. Here is my code sample
json_data = '''{ "id": 1,
"things": [
{
"tId": 1,
"objs": [{"this": 99},{"this": 100}]
},
{
"tId": 2,
"objs": [{"this": 222},{"this": 22222}]
}
]
}'''
rule = """[{id: .id,
tid: .things[].tId,
this: .things[].objs[].this}]"""
out = jq(rule, _in=json_data).stdout
res = pd.DataFrame(json.loads(out))
The problem is the output I am receiving is this:
id this tid
0 1 99 1
1 1 100 1
2 1 222 1
3 1 22222 1
4 1 99 2
5 1 100 2
6 1 222 2
7 1 22222 2
I am expecting to see
id this tid
0 1 99 1
1 1 100 1
3 1 222 2
4 1 22222 2
Any tips on how to make this work, different solutions, or a java option would be great!
Thanks in advance!
Craig
The problem is that your "rule" creates a Cartesian product, whereas in effect you want nested iteration.
With your input, the following jq expression, which makes the nested iteration reasonably clear, produces the output as shown:
.id as $id
| .things[] as $thing
| $thing.objs[]
| [$id, .this, $thing.tId]
| #tsv
Output
1 99 1
1 100 1
1 222 2
1 22222 2
Rule
So presumably your rule should look something like this:
[{id} + (.things[] | {tid: .tId} + (.objs[] | {this}))]
or if you want to make the nested iteration clearer:
[ .id as $id
| .things[] as $thing
| $thing.objs[]
| {id: $id, this, tid: $thing.tId} ]
Running jq in java
Besides processBuilder, you might like to take a look at these wrappers:
https://github.com/eiiches/jackson-jq
https://github.com/arakelian/java-jq

I got a different result when I retrained the sentiment model with Stanford CoreNLP to compare with the related paper's result

I downloaded stanford-corenlp-full-2015-12-09.
And I created a training model with the following command:
java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz
When I finished training, I found many files in my directory.
the model list
Then I used the evaluation tool from the package and I ran the code like this:
java -cp * edu.stanford.nlp.sentiment.Evaluate -model model-0024-79.82.ser.gz -treebank test.txt
The test.txt was from trainDevTestTrees_PTB.zip. This is the result about code:
F:\trainDevTestTrees_PTB\trees>java -cp * edu.stanford.nlp.sentiment.Evaluate -model model-0024-79.82.ser.gz -treebank test.txt
EVALUATION SUMMARY
Tested 82600 labels
65331 correct
17269 incorrect
0.790932 accuracy
Tested 2210 roots
890 correct
1320 incorrect
0.402715 accuracy
Label confusion matrix
Guess/Gold 0 1 2 3 4 Marg. (Guess)
0 551 340 87 32 6 1016
1 956 5348 2476 686 191 9657
2 354 2812 51386 3097 467 58116
3 146 744 2525 6804 1885 12104
4 1 11 74 379 1242 1707
Marg. (Gold) 2008 9255 56548 10998 3791
0 prec=0.54232, recall=0.2744, spec=0.99423, f1=0.36442
1 prec=0.5538, recall=0.57785, spec=0.94125, f1=0.56557
2 prec=0.8842, recall=0.90871, spec=0.74167, f1=0.89629
3 prec=0.56213, recall=0.61866, spec=0.92598, f1=0.58904
4 prec=0.72759, recall=0.32762, spec=0.9941, f1=0.4518
Root label confusion matrix
Guess/Gold 0 1 2 3 4 Marg. (Guess)
0 50 60 12 9 3 134
1 161 370 147 94 36 808
2 31 103 102 60 32 328
3 36 97 123 305 265 826
4 1 3 5 42 63 114
Marg. (Gold) 279 633 389 510 399
0 prec=0.37313, recall=0.17921, spec=0.9565, f1=0.24213
1 prec=0.45792, recall=0.58452, spec=0.72226, f1=0.51353
2 prec=0.31098, recall=0.26221, spec=0.87589, f1=0.28452
3 prec=0.36925, recall=0.59804, spec=0.69353, f1=0.45659
4 prec=0.55263, recall=0.15789, spec=0.97184, f1=0.24561
Approximate Negative label accuracy: 0.638817
Approximate Positive label accuracy: 0.697140
Combined approximate label accuracy: 0.671925
Approximate Negative root label accuracy: 0.702851
Approximate Positive root label accuracy: 0.742574
Combined approximate root label accuracy: 0.722680
The accuracy about fine-grained and positive/negative was quite different from the paper "Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y. and Potts, C., 2013, October. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (Vol. 1631, p. 1642)."
The paper states the accuracy about fine-grained and positive/negative is higher than mine.
The records in the paper
Were there any problems with my operation? Why was my result different from the paper?
The short answer is that the paper used a different system written in Matlab. The Java system does not match the paper. Though we do distribute the binary model we trained in Matlab with the English models jar. So you can RUN the binary model with Stanford CoreNLP, but you cannot TRAIN a binary model with similar performance with Stanford CoreNLP at this time.

Categories