How to remove duplicate columns after a JOIN in Pig?

How to remove duplicate columns after a JOIN in Pig? - java

Let's say I JOIN two relations like:
-- part looks like:
-- 1,5.3
-- 2,4.9
-- 3,4.9
-- original looks like:
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,akhila,3.3,IT,C,1.3,0.3
jnd = JOIN part BY $0, original BY $0;
The output will be:
1,5.3,1,Anju,3.6,IT,A,1.6,0.3
2,4.9,2,Remya,3.3,EEE,B,1.6,0.3
3,4.9,3,akhila,3.3,IT,C,1.3,0.3
Notice that $0 is shown twice in each tuple. EG:
1,5.3,1,Anju,3.6,IT,A,1.6,0.3
^ ^
|-----|
I can remove the duplicate key manually by doing:
jnd = foreach jnd generate $0,$1,$3,$4 ..;
Is there a way to remove this dynamically? Like remove(the duplicate key joiner).

Have faced the same kind of issue while working on Data Set Joining and other data processing techniques where in output the column names get repeated.
So was working on UDF which will remove the duplicates column by using schema name of that field and retaining the first unique column occurrence data.
Pre-Requisite:
Name of all the fields should be present
You need to download this UDF file and make it jar so as to use it.
UDF file location from GitHub :
GitHub UDF Java File Location
We will take the above question as example.
--Data Set A contains this data
-- 1,5.3
-- 2,4.9
-- 3,4.9
--Data Set B contains this data
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,Akhila,3.3,IT,C,1.3,0.3
PIG Script:
REGISTER /home/user/
DSA = LOAD '/home/user/DSALOC' AS (ROLLNO:int,CGPA:float);
DSB = LOAD '/home/user/DSBLOC' AS (ROLLNO:int,NAME:chararray,SUB1:float,BRANCH:chararray,GRADE:chararray,SUB2:float);
JOINOP = JOIN DSA BY ROLLNO,DSB BY ROLLNO;
We will get column name after joining as
DSA::ROLLNO:int,DSA::CGPA:float,DSB::ROLLNO:int,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float
For making it to
DSA::ROLLNO:int,DSA::CGPA:float,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float
DSB::ROLLNO:int is removed.
We need to use the UDF as
JOINOP_NODUPLICATES = FOREACH JOINOP GENERATE FLATTEN(org.imagine.REMOVEDUPLICATECOLUMNS(*));
Where org.imagine.REMOVEDUPLICATECOLUMNS is the UDF.
This UDF removes duplicate columns by using Name in schema.So DSA::ROLLNO:int is retained and DSB::ROLLNO:int is removed from the dataset.

Related

Update specific location of file separated by specific character in JAVA

I am trying to do txt based database system. I'm stuck here now. What I want to do is enter the location of the data and then update it. I separate the data with this character. "|"
Structure like this:
ID |Name |Job |Phone Number
---+-----+--------+------------
55 |John |Plumber |555444
The id part is to find out which row it is in, and the name part is in the column.
data_Update(filename, id, "Name", "Bob Ross");
I want to do a function like this.

You could do it in following manner:
Read the file and for each line of text add a entry in your HashMap
Map<Integer, Map<String,Object>> personMap
Where key represent Id of the person, And Value represent mapping of field name to field value for the current entry.
In your db_update method, locate the person by id and update e.g.
personMap.get(Id).put(fieldname,value)

Apache Pig process CSV with fields wrapped in quotes

How I can process CSV file where some fields are wrapped in quotes?
Line to process for example (field delimiter is ',')
I am column1, I am column2, "yes, I'm am column3"
The example has three columns. But the following example will say that I have four columns:
A = load '/path/to/file' using PigStorage(',');
Please, any suggestions, link to resource..?

Try loading the data, then do a FOREACH GENERATE to regenerate the data into whatever format you need. For the fields where you need to remove the quotes, use a REPLACE($3, '\"').
data = LOAD 'testdata' USING PigStorage(",");
data = FOREACH data GENERATE
(chararray) $0 AS col1:chararray,
(chararray) $1 AS col2:chararray,
(chararray) REPLACE($3, '\"') AS col3:chararray);

Hibernate multiple parameters setString generated Java

I'm using hibernate, and trying to do a LIKE on certain fields.
I'm splitting a string, and then generating the HQL, with
table.entry LIKE :argsearch_0 OR table.entry LIKE :argsearch_0 OR
table.entry LIKE :argsearch_1 OR table.entry LIKE :argsearch_1
(0 and 1 is in fact incremented with a counter).
But i get :
Not all named parameters have been set: [argsearch_0]
First question :
Can I used 2 named parameter and do only 1 setParameter (or setString) :
String nameParam = "argsearch_"+i;
q.setParameter(nameParam, "%"+args[i]+"%");
Second question :
Why my parameters are not working ?

Depends what you mean when you ask "Can I used 2 named parameter and do only 1 setParameter".
In your original query you have 2 named parameters ('argsearch_0' and 'argsearch_1') and each has 2 usages in the query. So you have to call set for both 'argsearch_0' and 'argsearch_1'. But you only call set once for each (actually you can call set multiple times for each parameter if you really want, but only the last once is used.
As for your second question, as someone already pointed out, its because you have a bug in your code. You are not setting the value for the 'argsearch_0' parameter.

You can Try this
**Step 1--:** Add How many parameter you need just add in hashmap
-------------------------------------------------------------------
HashMap param_List=new HashMap();
param_List.put("contactNo",22);
**Step 2--:** Just You pass your Query
-------------------------------------------------------------------
Query query1 = session.createQuery("select * from emailTemplate where c.contactNo =:contactNo");
**Step 3--:** What ever Data type is no matters but get an output.
-------------------------------------------------------------------
for(Object paramKey : param_List.keySet())
{
query1.setParameter(paramkey.toString(), param_List.get(paramKey);
}
**Step 4--:**
-------------------------------------------------------------------
String finalResult=query1.getSingleResult().toString();

How to store grouped records into multiple files with Pig?

After loading and grouping records, how can I store those grouped records into several files, one per group (=userid)?
records = LOAD 'input' AS (userid:int, ...);
grouped_records = GROUP records BY userid;
I'm using Apache Pig version 0.8.1-cdh3u3 (rexported)

Indeed, there is a MultiStorage class at Piggybank which does exactly what I want - it splits the records by a specified attribute (at index '0' in my example):
STORE records INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0', 'none', ',');

A = LOAD 'mydata' USING PigStorage() as (a, b, c);
STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', 'bz2', '\\t');
Parameters:
parentPathStr - Parent output dir path
splitFieldIndex - key field index
compression - 'bz2', 'bz', 'gz' or 'none'
fieldDel - Output record field delimiter.
Reference: GrepCode

retrieving the values from the nested hashmap

I have a XML file with many copies of table node structure as below:
<databasetable TblID=”123” TblName=”Department1_mailbox”>
<SelectColumns>
<Slno>dept1_slno</Slno>
<To>dept1_to</To>
<From>dept1_from</From>
<Subject>dept1_sub</Subject>
<Body>dept1_body</Body>
<BCC>dept1_BCC</BCC>
<CC>dept1_CC</CC>
</SelectColumns>
<WhereCondition>MailSentStatus=’New’</WhereCondition>
<UpdateSuccess>
<MailSentStatus>’Yes’</MailSentStatus>
<MailSentFailedReason>’Mail Sent Successfully’</MailSentFailedReason>
</UpdateSuccess>
<UpdateFailure>
<MailSentStatus>’No’</MailSentStatus>
<MailSentFailedReason>’Mail Sending Failed ’</MailSentFailedReason>
</ UpdateFailure>
</databasetable>
As it is not an efficient manner to traverse the file for each time to fetch the details of each node for the queries in the program, I used the nested hashmap concept to store the details while traversing the XML file for the first time. The structure I used is as below:
MapMaster
Key Value
123 MapDetails
Key Value
TblName Department1_mailbox
SelectColumns mapSelect
Key Value
Slno dept1_slno
To dept1_to
From dept1_from
Subject dept1_sub
Body dept1_body
BCC dept1_BCC
CC dept1_CC
WhereCondition MailSentStatus=’New’
UpdateSuccess mapUS
MailSentStatus ’Yes’
MailSentFailedReason ’Mail Sent Successfully’
UpdateFailure mapUF
MailSentStatus ’No’
MailSentFailedReason ’Mail Sending Failed’
But the problem I’m facing now is regarding retrieving the Value part using the nested Keys. For example,
If I need the value of Slno Key, I have to specify TblID, SelectColumns, Slno in nested form like:
Stirng Slno = ((HashMap)((HashMap)mapMaster.get(“123”))mapDetails.get(“SelectColumns”))mapSelect.get(“Slno”);
This is unconvinent to use in a program. Please suggest a solution but don’t tell that iterators are available. As I’ve to fetch the individual value from the map according to the need of my program.
EDIT:my program has to fetch the IDs of the department for which there is privilege to send mails and then these IDs are compared with the IDs in XML file. Only information of those IDs are fetched from XML which returned true in comparison. This is all my program. Please help.
Thanks in advance,
Vishu

Never cast to specific Map implementation. Better use casting to Map interface, i.e.
((Map)one.get("foo")).get("bar")
Do not use casting in your case. You can define collection using generics, so compiler will do work for you:
Map<String, Map> one = new HashMap<String, Map>();
Map<String, Integer> two = new HashMap<String, Integer>();
Now your can say:
int n = one.get("foo").get("bar");
No casting, no problems.
But the better solution is not to use nested tables at all. Create your custom classes like SelectColumns, WhereCondition etc. Each class should have appropriate private fields, getters and setters. Now parse your XML creating instance of these classes. And then use getters to traverse the data structure.
BTW if you wish to use JAXB you do not have to do almost anything! Something like the following:
Unmarshaller u = JAXBContext.newInstance(SelectColumns.class, WhereCondition.class).createUnmarshaller();
SelectColumns[] columns = (SelectColumns[])u.unmarshal(in);

One approach to take would be to generate fully qualified keys that contain the XML path to the element or attribute. These keys would be unique, stored in a single hashmap and get you to the element quickly.
Your code would simply have to generate a unique textual representation of the path and store and retrieve the xml element based on the key.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to remove duplicate columns after a JOIN in Pig? - java

Related

Update specific location of file separated by specific character in JAVA

Apache Pig process CSV with fields wrapped in quotes

Hibernate multiple parameters setString generated Java

How to store grouped records into multiple files with Pig?

retrieving the values from the nested hashmap

Categories

Resources