Scala - how to format a collection into a String? - java

I'm trying to parse Metrics data into a formatted String so that there is a header and each record below starts from a new line. Initially I wanted to get something close to a table formatting like this:
Id | Name | Rate | Value
1L | Name1 | 1 | value_1
2L | Name2 | 2 | value_2
3L | Name3 | 3 | value_3
But my current implementation results in the following Error:
java.util.MissingFormatArgumentException: Format specifier '%-70s'
What should I change in my code to get it formatted correctly?
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
case class BaseMetric(val id: Long,
val name: String,
val rate: String,
val value: String,
val count: Long,
val isValid: Boolean
) {
def makeCustomMetric: String = Seq(id, name, rate, value).mkString("\t")
}
val metric1 = new BaseMetric(1L, "Name1", "1", "value_1", 10L, true)
val metric2 = new BaseMetric(2L, "Name2", "2", "value_2", 20L, false)
val metric3 = new BaseMetric(3L, "Name3", "3", "value_3", 30L, true)
val metrics = Seq(metric1, metric1, metric1)
def formatMetrics(metrics: Seq[BaseMetric]): String = {
val pattern = "%-50s | %-70s | %-55s | %-65s | %f"
val formattedMetrics: String = pattern.format(metrics.map(_.makeCustomMetric))
.mkString("Id | Name | Rate | Value\n", "\n", "\nId | Name | Rate | Value")
formattedMetrics
}
val metricsString = formatMetrics(metrics)

The specific error is due to the fact that you pass a Seq[String] to format which expects Any*. You only pass one parameter instead of five. The error says it doesn't find an argument for your second format string.
You want to apply the pattern on every metric, not all the metrics on the pattern.
The paddings in the format string are too big for what you want to achieve.
val pattern = "%-2s | %-5s | %-4s | %-6s"
metrics.map(m => pattern.format(m.makeCustomMetric: _*))
.mkString("Id | Name | Rate | Value\n", "\n", "\nId | Name | Rate | Value")
The _* tells the compiler that you want to pass a list as variable length argument.
makeCustomMetric should return only the List then, instead of a string.
def makeCustomMetric: String = Seq(id, name, rate, value)

Scala string interpolation is the optimized way to concat/foramt strings.
Reference: https://docs.scala-lang.org/overviews/core/string-interpolation.html
s"id: $id ,name: $name ,rate: $rate ,value: $value ,count: $count, isValid: $isValid"

Related

How to parse a column that has a custom json format from a spark DataFrame

I have a spark data frame containig a json column, formatted differently from the standard:
|col_name |
|{a=6236.0, b=0.0} |
|{a=323, b=2.3} |
As you can see the json contains the = sign for the fields instead of :
If I use the predefined function from_json this will yield null as the column doesn't have the standard format. Is there another way to parse this column into two separate columns?
I don't see any simple way to parse this input easily. You need to break the string and construct the json using a udf. Check this out:
scala> val df = Seq(("{a=6236.0, b=0.0}"),("{a=323, b=2.3} ")).toDF("data")
df: org.apache.spark.sql.DataFrame = [data: string]
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val sch1 = new StructType().add($"a".string).add($"b".string)
sch1: org.apache.spark.sql.types.StructType = StructType(StructField(a,StringType,true), StructField(b,StringType,true))
scala> def json1(x:String):String=
| {
| val coly = x.replaceAll("[{}]","").split(",")
| val cola = coly(0).trim.split("=")
| val colb = coly(1).trim.split("=")
| "{\""+cola(0)+"\":"+cola(1)+ "," + "\"" +colb(0) + "\":" + colb(1) + "}"
| }
json1: (x: String)String
scala> val my_udf = udf( json1(_:String):String )
my_udf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.withColumn("n1",my_udf('data)).select(from_json($"n1",sch1) as "data").select("data.*").show(false)
+------+---+
|a |b |
+------+---+
|6236.0|0.0|
|323 |2.3|
+------+---+
scala>

Elasticsearch - how to group by and count matches in an index

I have an instance of Elasticsearch running with thousands of documents. My index has 2 fields like this:
|____Type_____|__ Date_added __ |
| walking | 2018-11-27T00:00:00.000 |
| walking | 2018-11-26T00:00:00.000 |
| running | 2018-11-24T00:00:00.000 |
| running | 2018-11-25T00:00:00.000 |
| walking | 2018-11-27T04:00:00.000 |
I want to group by and count how many matches were found for the "type" field, in a certain range.
In SQL I would do something like this:
select type,
count(type)
from index
where date_added between '2018-11-20' and '2018-11-30'
group by type
I want to get something like this:
| type | count |
| running | 2 |
| walking | 3 |
I'm using the High Level Rest Client api in my project, so far my query looks like this, it's only filtering by the start and end time:
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(QueryBuilders
.boolQuery()
.must(QueryBuilders
.rangeQuery("date_added")
.from(start.getTime())
.to(end.getTime()))
)
);
How can I do a "group by" in the "type" field? Is it possible to do this in ElasticSearch?
That's a good start! Now you need to add a terms aggregation to your query:
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(QueryBuilders.boolQuery()
.must(QueryBuilders
.rangeQuery("date_added")
.from(start.getTime())
.to(end.getTime()))
)
);
// add these two lines
TermsAggregationBuilder groupBy = AggregationBuilders.terms("byType").field("type.keyword");
sourceBuilder.aggregation(groupBy);
After using Val's reply to aggregate the fields, I wanted to print the aggregations of my query together with the value of them. Here's what I did:
Terms terms = searchResponse.getAggregations().get("byType");
Collection<Terms.Bucket> buckets = (Collection<Bucket>) terms.getBuckets();
for (Bucket bucket : buckets) {
System.out.println("Type: " + bucket.getKeyAsString() + " = Count("+bucket.getDocCount()+")");
}
This is the output after running the query in an index with 2700 documents with a field called "type" and 2 different types:
Type: walking = Count(900)
Type: running = Count(1800)

How to create a new column in a Spark DataFrame based on a second DataFrame (Java)?

I have two Spark DataFrames where one of them has two cols, id and Tag. A second DataFrame has an id col, but missing the Tag. The first Dataframe is essentially a dictionary, each id appears once, while in the second DataFrame and id may appear several times. What I need is to create a new col in the second DataFrame that has the Tag as a function of the id in each row (in the second DataFrame). I think this can be done by converting to RDDs first ..etc, but I thought there must be a more elegant way using DataFrames (in Java). Example: given a df1 Row-> id: 0, Tag: "A", a df2 Row1-> id: 0, Tag: null, a df2 Row2-> id: 0, Tag: "B", I need to create a Tag col in the resulting DataFrame df3 equal to df1(id=0) = "A" IF df2 Tag was null, but keep original Tag if not null => resulting in df3 Row1-> id: 0, Tag: "A", df3 Row2-> id: 0, Tag: "B". Hope the example is clear.
| ID | No. | Tag | new Tag Col |
| 1 | 10002 | A | A |
| 2 | 10003 | B | B |
| 1 | 10004 | null | A |
| 2 | 10005 | null | B |
All you need here is left outer join and coalesce:
import org.apache.spark.sql.functions.coalesce
val df = sc.parallelize(Seq(
(1, 10002, Some("A")), (2, 10003, Some("B")),
(1, 10004, None), (2, 10005, None)
)).toDF("id", "no", "tag")
val lookup = sc.parallelize(Seq(
(1, "A"), (2, "B")
)).toDF("id", "tag")
df.join(lookup, df.col("id").equalTo(lookup.col("id")), "leftouter")
.withColumn("new_tag", coalesce(df.col("tag"), lookup.col("tag")))
This should almost identical to Java version.

Regular Expression in Java for UMASK

I need a Java regular expression to get the following two values:
the value of the UMASK parameter in file /etc/default/security should be set to 077. [Current value: 022] [AA.1.9.3]
the value of UMASK should be set to 077 in /etc/skel/.profile [AA.1.9.3]
I need to get the file name from the input string, as well as the current value if existing.
I wrote a regex as .* (.*?/.*?) (?:\\[Current value\\: (\\d+)\\])?.* for this one, it can match both strings, also to get the file name, but can NOT get the current value.
Then another regex: .* (.*?/.*?) (?:\\[Current value\\: (\\d+)\\])? .* comparing with the first one, there is a space before the last .* for this one, it can match the string 1, and get file name and current value, but it can NOT match the string 2...
What how can I correct these regular expressions to obtain the values described above?
If I understand your requisites correctly (file name and current octal permissions value), you can use the following Pattern:
String input =
"Value for parameter UMASK in file /etc/default/security should be set to 077. " +
"[Current value: 022] [AA.1.9.3] - " +
"Value of UMASK should be set to 077 in /etc/skel/.profile [AA.1.9.3]";
// | "file " marker
// | | group 1: the file path
// | | | space after
// | | || any characters
// | | || | escaped square bracket
// | | || | | "Current value: " marker
// | | || | | | group 2:
// | | || | | | digits for value
// | | || | | | | closing bracket
Pattern p = Pattern.compile("file (.+?) .+?\\[Current value: (\\d+)\\]");
Matcher m = p.matcher(input);
// iterates, but will find only once in this instance (which is desirable)
while (m.find()) {
System.out.printf("File: %s%nCurrent value: %s%n", m.group(1), m.group(2));
}
Output
File: /etc/default/security
Current value: 022

Removing null elements and keeping non-null elements together on a list in jasper reports

I am using JRBeanCollectionDataSource as datasource for a subreport. Each record in the list contains elements with either null or non-null value . This is my POJO:
public class PayslipDtl {
private String earningSalaryHeadName;
private double earningSalaryHeadAmount;
private String deductionSalaryHeadName;
private double deductionSalaryHeadAmount;
String type;
public PayslipDtl(String salaryHeadName,
double salaryHeadAmount, String type) {
if(type.equalsIgnoreCase("Earning")) {
earningSalaryHeadName = salaryHeadName;
earningSalaryHeadAmount = salaryHeadAmount;
} else {
deductionSalaryHeadName = salaryHeadAmount;
deductionSalaryHeadAmount = salaryHeadAmount;
}
}
//getters and setters
}
Based on the "type", the list is populated as such: {"Basic", 4755, null, 0.0}, {"HRA", 300, null, 0.0}, {null, 0.0, "Employee PF", 925}, {"Medical Allowance", 900, null, 0.0} and so on...
After setting isBlankWhenNull to true and using "Print when" expression, the record is displayed as such:
|Earning |Amount|Deduction |Amount|
--------------------|------|---------------------|------|
| Basic | 4755 | | |
| HRA | 300 | | |
| | | Employee PF | 925 |
| Medical Allowance | 900 | | |
| Fuel Reimbursement| 350 | | |
| | | Loan | 1000 |
---------------------------------------------------------
I want it to be displayed as such:
|Earning |Amount|Deduction |Amount|
--------------------|------|---------------------|------|
| Basic | 4755 | Employee PF | 925 |
| HRA | 300 | Loan | 1000 |
| Medical Allowance | 900 | | |
| Fuel Reimbursement| 350 | | |
---------------------------------------------------------
Setting isRemoveLineWhenBlank to true doesn't work since it is not the entire row which is blank but only a subset of elements of a row that is null.
Is it possible in Jasper?
I am using iReport Designer 5.0.1 with compatibility set to JasperReports3.5.1.
Use a List component for the deduction/amount, here you have a video tutorial on how to do this.
Then deduction and amount fields on the list component need the following options Blank when null and Remove line when blank.
If this still gives you blank lines, try putting both fields on a frame inside the list and mark those options for the frame too.
Only one good solution is, you have to create separate table as:
table employeeED:
srno int,
Earning varchar(50),
EarnAmount Double,
Deduction varchar(50)
DedAmount Double
then you have to insert all earnings in earning side and update all deductions in deductions side.
int i=1;
rs.first();
while(rs.next())
{
if(rs.getString("type").equals("Earning"))
Insert into employeeEd (srno, Earning,EarnAmount) values (i, rs('earning'), rs('eamt'))
}
int j=1;
rs.first();
while(rs.next())
{
if(rs.getString("type").equals("deduction"))
update employeeEd set Deductions='"+rs('earning')+"', DedAmount=" + rs('eamt') + " where srno="+j)
j++;
}
then use employeeED table as datasource.
100% working.

Categories