How to remove minus and plus sign duplicates via Talend job?

How to remove minus and plus sign duplicates via Talend job? - java

I have loaded local file into talend process and need to do below condition this file data
Below my csv file data showing like
NO,DATE,MARK
123,2015-03-01,200
123,2015-03-01,-200
123,2015-03-01,200
123,2015-03-01,200
125,2016-01-01,80
Here above "200" and "-200" two values availed. if I have -200
I need to remove corresponding +200 value after that If I have same NO,DATE,MARK then I need to remove duplicates two
" 123,2015-03-01,200"," 123,2015-03-01,200" = " 123,2015-03-01,200"
Finally my result should come like below
NO,DATE,MARK
123,2015-03-01,200
125,2016-01-01,80
After that I need to some 200 + 80 = 125,2016-01-01,280. How to do above process using talend job.

Step by step, we can start by removing this:
123,2015-03-01,200
123,2015-03-01,-200
we can do it by summing MARK after grouping by NO and DATE by using the talend compoenet tAggregateRow. After, we will get :
123,2015-03-01,0
Now we can use the component tFilterRow to remove all rows having MARK == 0, and the component tUniqRow to remove duplicated rows.
The last step is to get the sum of MARK using tAggregateRow and store it in a context variable, then get the greatest NO and the latest DATE by using the component tSortRow and then get only that row using tSampleRow. We can affect the sum of MARK.

Related

BOM Problem for matching 2 keys' value with TreeMap.getEntryUsingComparator()

I'm trying to follow the tutorial on this page (https://www.bezkoder.com/spring-boot-upload-csv-file/) in order to insert information from a csv file into a mysql DB but I got stuck by something in the class CSVHelper.
While doing my search, I've found the problem was located in TreeMap.getEntryUsingComparator() where the key value doesn't match with any of the values of the headerMap.
When I checked the variables in the Debug view, I saw the first values were different whereas the text was the same ("Id").
The key argument ("Id") has for value [73, 100]
The headersMap ("Id") has for value [-1, -2, 73, 0, 100, 0]
I have checked the header in the file and there's no space. Otherwise, all the other headers work fine.
After changing the order of the headers, it spotlights that the problem is for the 1st header name. It adds [-1,-2] at the beginning and 0 between the other values.
So, what do you think it can possibly be ? What can I do to solve this ?
Project on Github, branch dev-mysql-csv

This change at the beginning of the input was the consequence of the BOM (Byte Order Mark). The csv file wasn't saved at the good format (I changed from "csv delimited with comma" to "csv delimited with semicolon") then it worked.
But this is ok only when the separator was "," and not ";" that is very odd...
In order to handle the BOM, there is BOMInputStream. I succeeded to run "csv delimited with comma".
I tried to use withRecordSeparator(";") and withDelimiter(";") in order to make it work with ";" but it failed.

Cant parse pipe delimited header data into correct variable

I have a file with data in the first row that i want to extract the data looks like
20200403|AS421|||FINN|
public void handleLine(String line) {
if (line.contains(firstJobConfig.DELIMITER_PIPE)){
headerInfo.setcreateDate(line.substring(0, line.indexOf(firstJobConfig.DELIMITER_PIPE)));
headerInfo.setformName(line.substring(line.indexOf(firstJobConfig.DELIMITER_PIPE)));
}
}
}
I have code that pulls 20200403 into my createDate variable but i cant figure out how to get my formName to be set to AS421. right now its set to |AS421|||FINN|. i know that if i doline.substring(9,14)); it will work but i want to start after the first pipe delimiter( |) and stop at the next one.

Right now, you're doing this: headerInfo.setformName(line.substring(line.indexOf(firstJobConfig.DELIMITER_PIPE))) -> you're taking substring starting with the index equals to index where the first delimiter is and aren't specifying the end of this substring (That's why the result of the second substring is: |AS421|||FINN|). So the better way will be to use line.split("\\|") - It will return the table of 5 elements in your case: ["20200403","AS421","","","FINN"]. And then you can do:
headerInfo.setcreateDate(table[0]);
headerInfo.setformName(table[1])

You can split the strings like below.
Add a + to match one or more instances of the pipe:
temp.split("\\|+");

Using trim() but still didn't get expected output

Ok,i am developing spring MVC based web application, application shows data is list, and i also facilitate filter options to enhanced search functionality, I also remove extra space by using trim(), but what happening now, when user input data in text field and enter the corresponding result will be displayed into the list, but if space added after input, the result will be "NOT FOUND" even i handle the space in javascript too
Java Code which fetches data from database
if (searchParamDTO.getRegNO().trim() != null && !searchParamDTO.getRegNO().trim().equals("") && !searchParamDTO.getRegNO().trim().equals("null")) {
query += " AND UR.REG_UNIQUE_ID = :REG_UNIQUE_ID ";
param.addValue("REG_UNIQUE_ID", searchParamDTO.getRegNO());
}
JavaScript Code: fetches the value in behalf of id
function setSearchParameters() {
regNo = $('#regNo').val().trim();}
i also attached two screenshot with spaces and without spaces
Without space
With space

As #Greg H said you're trimming the string when checking if it's blank, but then adding the raw string to the query which will include any trailing spaces.
Then, this line param.addValue("REG_UNIQUE_ID", searchParamDTO.getRegNO()); should be replaced by param.addValue("REG_UNIQUE_ID", searchParamDTO.getRegNO().trim());

Check for substring efficiently for large data sets

I have:
a database table with 400 000 000 rows (Cassandra 3)
a list of circa 10 000 keywords
both data sets are expected to grow in time
I need to:
check if a specified column contains a keyword
sum how many rows contained the keyword in the column
Which approach should I choose?
Approach 1 (Secondary index):
Create secondary SASI index on the table
Find matches for given keyword "on fly" anytime
However, I am afraid of
cappacity problem - secondary indices can consume extra space and for such large table it could be too much
performance - I am not sure if finding of keyword among hundreds milions of rows can be achieved in a reasonable time
Approach 2 (Java job - brute force):
Java job that continuously iterates over data
Matches are saved into cache
Cache is updated during the next iteration
// Paginate throuh data...
String page = null;
do {
PagingState state = page == null ? null : PagingState.fromString(page);
PagedResult<DataRow> res = getDataPaged(query, status, PAGE_SIZE, state);
// Iterate through the current page ...
for (DataRow row : res.getResult()) {
// Skip empty titles
if (row.getTitle().length() == 0) {
continue;
}
// Find match in title
for (String k : keywords) {
if (k.length() > row.getTitle().length()) {
continue;
}
if (row.getTitle().toLowerCase().contains(k.toLowerCase()) {
// TODO: SAVE match
break;
}
}
}
status = res.getResult();
page = res.getPage();
// TODO: Wait here to reduce DB load
} while (page != null);
Problems
It could be very slow to iterate through whole table. If I waited for one second per every 1000 rows, then this cycle would finish in 4.6 days
This would require extra space for cache; moreover, frequent deletions from cache would produce tombstones in Cassandra

A better way will be to use a search engine like SolR our ElasticSearch. Full text search is their speciality. You could easily dump your data from cassandra to Elasticsearch and implement your java job on top of ElasticSearch.
EDIT:
With Cassandra you can request your result query as a JSON and Elasticsearch 'speak' only in JSON so you will be able to transfer your data very easily.
Elasticsearch
SolR

Faceting using SolrJ and Solr4

I've gone through the related questions on this site but haven't found a relevant solution.
When querying my Solr4 index using an HTTP request of the form
&facet=true&facet.field=country
The response contains all the different countries along with counts per country.
How can I get this information using SolrJ?
I have tried the following but it only returns total counts across all countries, not per country:
solrQuery.setFacet(true);
solrQuery.addFacetField("country");
The following does seem to work, but I do not want to have to explicitly set all the groupings beforehand:
solrQuery.addFacetQuery("country:usa");
solrQuery.addFacetQuery("country:canada");
Secondly, I'm not sure how to extract the facet data from the QueryResponse object.
So two questions:
1) Using SolrJ how can I facet on a field and return the groupings without explicitly specifying the groups?
2) Using SolrJ how can I extract the facet data from the QueryResponse object?
Thanks.
Update:
I also tried something similar to Sergey's response (below).
List<FacetField> ffList = resp.getFacetFields();
log.info("size of ffList:" + ffList.size());
for(FacetField ff : ffList){
String ffname = ff.getName();
int ffcount = ff.getValueCount();
log.info("ffname:" + ffname + "|ffcount:" + ffcount);
}
The above code shows ffList with size=1 and the loop goes through 1 iteration. In the output ffname="country" and ffcount is the total number of rows that match the original query.
There is no per-country breakdown here.
I should mention that on the same solrQuery object I am also calling addField and addFilterQuery. Not sure if this impacts faceting:
solrQuery.addField("user-name");
solrQuery.addField("user-bio");
solrQuery.addField("country");
solrQuery.addFilterQuery("user-bio:" + "(Apple OR Google OR Facebook)");
Update 2:
I think I got it, again based on what Sergey said below. I extracted the List object using FacetField.getValues().
List<FacetField> fflist = resp.getFacetFields();
for(FacetField ff : fflist){
String ffname = ff.getName();
int ffcount = ff.getValueCount();
List<Count> counts = ff.getValues();
for(Count c : counts){
String facetLabel = c.getName();
long facetCount = c.getCount();
}
}
In the above code the label variable matches each facet group and count is the corresponding count for that grouping.

Actually you need only to set facet field and facet will be activated (check SolrJ source code):
solrQuery.addFacetField("country");
Where did you look for facet information? It must be in QueryResponse.getFacetFields (getValues.getCount)

In the solr Response you should use QueryResponse.getFacetFields() to get List of FacetFields among which figure "country". so "country" is idenditfied by QueryResponse.getFacetFields().get(0)
you iterate then over it to get List of Count objects using
QueryResponse.getFacetFields().get(0).getValues().get(i)
and get value name of facet using QueryResponse.getFacetFields().get(0).getValues().get(i).getName()
and the corresponding weight using
QueryResponse.getFacetFields().get(0).getValues().get(i).getCount()

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to remove minus and plus sign duplicates via Talend job? - java

Related

BOM Problem for matching 2 keys' value with TreeMap.getEntryUsingComparator()

Cant parse pipe delimited header data into correct variable

Using trim() but still didn't get expected output

Check for substring efficiently for large data sets

Faceting using SolrJ and Solr4

Categories

Resources