How does Spark in Java filter the values in the list in dataset?

How does Spark in Java filter the values in the list in dataset? - java

I have two class, one is NewsArticle: String id, String title, List contents, the other is ContentItem: String content, String subtype, String url.
I want to filter out the content whose subtype value is equal to "paragraph", and spliced into one long string. (don't need url)
here is the NewsArticle Dataset like:
1, "TiTle", [{htt..., paragraph, rem...},{htt..., paragraph, rem...},{htt..., paragraph, rem...}]
which is id, title, List<ContentItem>
I took out the contents column, and each single row is one article, it like this:
[{http..., others, con...},{http..., paragraph, rem...},{http..., paragraph, rem...}]
which is url, subtype, content
and now I want to make each article(row) look like:
1, "Title", "this is content which subtype equals paragraph"
can anyone help me with java?

This would work:
df
.withColumn("newContent", functions.explode(functions.col("items")))
.filter("newContent.subtype=='paragraph'")
.selectExpr("id", "title", "newContent.content as content")
.show(false);
Input:
+---+--------------------------------------------------------------------------------------------------------+-----+
|id |items |title|
+---+--------------------------------------------------------------------------------------------------------+-----+
|id |[[Content1, subtype1, someurl], [ContentOfParagraph, paragraph, someurl], [Content2, subtype2, someurl]]|Title|
+---+--------------------------------------------------------------------------------------------------------+-----+
Output:
+---+-----+------------------+
|id |title|content |
+---+-----+------------------+
|id |Title|ContentOfParagraph|
+---+-----+------------------+

Related

Nullifying elements that doesnt match condition in list using streams

I have an POJO like so,
class Val {
private String name;
private String value;
}
Now I've got a List<Val> values. I'm trying to obtain all the elements from values that has name "ABC" such that the elements in the list that doesn't have the matching name should be null.
For example:
List<Val> values = [Val(name=ABC,value=val1), Val(name=DEF,value=val2), Val(name=ABC,value=val3), Val(name=ABC,value=val4)]
values.stream().filter(x -> x.getName().equals("ABC")).collect(); //this will filter out the matching elements. This I know.
//Expected Output. Not sure how to do this.
List<Val> valuesOut = [Val(name=ABC,value=val1), null, Val(name=ABC,value=val3), Val(name=ABC,value=val4)]
Size of both the input and output lists remain the same. Only the elements with names not as ABC is turned to null.
Any suggestions on how to do this using streams?

Hibernate-search search by list of numbers

I am working in a Hibernate-search, Java application with an entity which has a numeric field indexed:
#Field
#NumericField
private Long orgId;
I want to get the list of entities which match with a list of Long values for this property. I used the "simpleQueryString" because it allows to use "OR" logic with char | for several objective values. I have something like this:
queryBuilder.simpleQueryString().onField("orgId").matching("1|3|8").createQuery()
After run mi application I get:
The specified query '+(orgId:1 orgId:3 orgId:8)' contains a string based sub query which targets the numeric encoded field(s) 'orgId'. Check your query or try limiting the targeted entities.
So, Can some body tell me what is wrong with this code?, Is there other way to do what I need?.
=================================
UPDATE 1:
yrodiere' answer solves the issue, but I have another doubt, I want validate whether entities match other fields, I know I can use BooleanJuntion, but then I need mix "must" and "should" usages right?. i.e.:
BooleanJunction<?> bool = queryBuilder.bool();
for (Integer orgId: orgIds) {
bool.should( queryBuilder.keyword().onField("orgId").matching(orgId).createQuery() );
}
bool.must(queryBuilder.keyword().onField("name").matching("anyName").createQuery() );
Then, I am validating that the entities must match a "name" and also they match one of the given orgIds, Am I right?

As the error message says:
The specified query [...] contains a string based sub query which targets the numeric encoded field(s) 'orgId'.
simpleQueryString can only be used to target text fields. Numeric fields are not supported.
If your string was generated programmatically, and you have a list of integers, this is what you'll need to do:
List<Integer> orgIds = Arrays.asList(1, 3, 8);
BooleanJunction<?> bool = queryBuilder.bool();
for (Integer orgId: orgIds) {
bool.should( queryBuilder.keyword().onField("orgId").matching(orgId).createQuery() );
}
LuceneQuery query = bool.createQuery();
query will match documents whose orgId field contains 1, 3 OR 8.
See https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#_combining_queries
EDIT: If you need additional clauses, I'd recommend not mixing must and should in the same boolean junction, but nesting boolean junctions instead.
For example:
BooleanJunction<?> boolForOrgIds = queryBuilder.bool();
for (Integer orgId: orgIds) {
boolForOrgIds.should(queryBuilder.keyword().onField("orgId").matching(orgId).createQuery());
}
BooleanJunction<?> boolForWholeQuery = queryBuilder.bool();
boolForWholeQuery.must(boolForOrgIds.createQuery());
boolForWholeQuery.must(queryBuilder.keyword().onField("name").matching("anyName").createQuery());
// and add as many "must" as you need
LuceneQuery query = boolForWholeQuery.createQuery();
Technically you can mix 'must' and 'should', but the effect won't be what you expect: 'should' clauses will become optional and will only raise the score of documents when they match. So, not what you need here.

The best way how to keep select result in array

I have a question about keeping query result in array. For example I execute a query
SELECT * FROM some_table
Then I want to save it to array and create records. The table contains these columns:
id
user_name
last_name
The result array can be:
[[1, "First user name", "First last name"],
[2, "Second user name", "Second last name"]
...
].
Can you recommend me which array or data type should I use?

You do that like this:
Create a bean class for User
public class User {
private int id;
private String firstName;
private String lastName;
// getter and setter
...
}
And then query all the data from table, and create a User object and set the data.
List<User> users = new ArrayList<>();
while(someValue) {
...
int id = ...
String firstName= ...
String lastName = ...
User user = new User();
user.setId(id);
user.setFirstName(firstName);
user.setLastName(lastName);
users .add(user);
}
// after do what you want with the list

I extend your question to "the best way to keep select result" (with or without array).
It depends on:
how many results
how many fields
what do you want to do after ?
do you want to modify, put in your database again ?
So, several propositions:
just arrays: String[] fields1; String[] fields2, ...
array of arrays: String[][];
better collections: Vector, List or Set: do you want them to be sorted ?, how do you pick them after ? Or Map, (if you want to keep index => data)
or Object you create yourself. For this, you even have tools to map object-database.
you should take a look at these features, and what you want to do .
Hope it helps.

java key pressed event off by one letter

this is the method I'm using to test this:
private void searchFieldKeyTyped(java.awt.event.KeyEvent evt) {
String query = searchField.getText();
System.out.println(query);
}
if i type one letter though, query contains an empty string
if i type another letter, query contains the single previous letter
so if i type "a", query is empty
If I type "ab", query contains "a"
If i type "abc", query contains "ab"
If I type "abcd", query contains "abc"
and so on.

As discussed in the comments, use KEY_RELEASED rather than KEY_PRESSED.

A strategy for parsing a tab-separated file

What would be the most primitive way of parsing a tab-separated file in Java, so that the tabular data would not lose the structure? I am not looking for a way to do it with Bean or Jsoup, since they are not familiar to me, a beginner. I need advice on what would be the logic behind it and what would be the efficient way to do it, for example if I have a table like
ID reference | Identifier | Type 1| Type 2 | Type 3 |
1 | red#01 | 15% | 20% | 10% |
2 | yellow#08 | 13% | 20% | 10% |
Correction: In this example I have Types 1 - 3, but my question applies to N number of types.
Can I achieve table parsing by just using arrays or is there a different data structure in Java that would be better for this task? This is how I think I should do it:
Scan/read the first line splitting at "\t" and create a String array.
Split that array into sub-arrays of 1 table heading per sub-array
Then, start reading the next line of the table, and for each sub-array, add the corresponding values from the columns.
Does this plan sound right or am I overcomplicating things/being completely wrong? Is there an easier way to do it? (provided that I still don't know how to split arrays into subarrays and how to populate the subarrays with the values from the table)

I would strongly suggest you use a read flat file parsing library for this, like the excellent OpenCSV.
Failing that, here is a solution in Java 8.
First, create a class to represent your data:
static class Bean {
private final int id;
private final String name;
private final List<Integer> types;
public Bean(int id, String name, List<Integer> types) {
this.id = id;
this.name = name;
this.types = types;
}
//getters
}
Your suggestion to use various lists is very scripting based. Java is OO so you should use that to your advantage.
Now we just need to parse the file:
public static void main(final String[] args) throws Exception {
final Path path = Paths.get("path", "to", "file.tsv");
final List<Bean> parsed;
try (final Stream<String> lines = Files.lines(path)) {
parsed = lines.skip(1).map(line -> line.split("\\s*\\|\\s*")).map(line -> {
final int id = Integer.parseInt(line[0]);
final String name = line[1];
final List<Integer> types = Arrays.stream(line).
skip(2).map(t -> Integer.parseInt(t.replaceAll("\\D", ""))).
collect(Collectors.toList());
return new Bean(id, name, types);
}).collect(Collectors.toList());
}
}
In essence the code skips the first line then loops over lines in the file and for each line:
Split the line on the delimiter - seems to be |. This requires regex so you need to escape the pipe as it is a special character. Also we consume any spaces before/after the delimiter.
Create a new Bean for each line by parsing the array elements.
First parse the id to an int
Next get the name
Finally get a Stream of the lines, skip the first two elements, and parse the remaining to a List<Integer>

I would suggest to use Apache Commons CSV package, like described on the homepage: http://commons.apache.org/proper/commons-csv/

I'd use Guava's Splitter and Table:
https://code.google.com/p/guava-libraries/wiki/StringsExplained#Splitter
https://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained#Table

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How does Spark in Java filter the values in the list in dataset? - java

Related

Nullifying elements that doesnt match condition in list using streams

Hibernate-search search by list of numbers

The best way how to keep select result in array

java key pressed event off by one letter

A strategy for parsing a tab-separated file

Categories

Resources