Apache Pig process CSV with fields wrapped in quotes - java

How I can process CSV file where some fields are wrapped in quotes?
Line to process for example (field delimiter is ',')
I am column1, I am column2, "yes, I'm am column3"
The example has three columns. But the following example will say that I have four columns:
A = load '/path/to/file' using PigStorage(',');
Please, any suggestions, link to resource..?

Try loading the data, then do a FOREACH GENERATE to regenerate the data into whatever format you need. For the fields where you need to remove the quotes, use a REPLACE($3, '\"').
data = LOAD 'testdata' USING PigStorage(",");
data = FOREACH data GENERATE
(chararray) $0 AS col1:chararray,
(chararray) $1 AS col2:chararray,
(chararray) REPLACE($3, '\"') AS col3:chararray);

Related

i want to convert a report which is in text format into a xlsx document. but the problem is data in text file has some missing column values

typical report data is like this,
A simple approach that i wanted to follow was to use space as a delimeter but the data is not in a well structured manner
read the first line of the file and split each column by checking if there is more than 1 whitespace. In addition to that you count how long each column is.
after that you can simply go through the other rows containing data and extract the information, by checking the length of the column you are at
(and please don't put images of text into stackoverflow, actual text is better)
EDIT:
python implementation:
import pandas as pd
import re
file = "path/to/file.txt"
with open("file", "r") as f:
line = f.readline()
columns = re.split(" +", line)
column_sizes = [re.finditer(column, line).__next__().start() for column in columns]
column_sizes.append(-1)
# ------
f.readline()
rows = []
while True:
line = f.readline()
if len(line) == 0:
break
elif line[-1] != "\n":
line += "\n"
row = []
for i in range(len(column_sizes)-1):
value = line[column_sizes[i]:column_sizes[i+1]]
row.append(value)
rows.append(row)
columns = [column.strip() for column in columns]
df = pd.DataFrame(data=rows, columns=columns)
print(df)
df.to_excel(file.split(".")[0] + ".xlsx")
You are correct that export from text to csv is not a practical start, however it would be good for import. So here is your 100% well structured source text to be saved into plain text.
And here is the import to Excel
you can use google lens to get your data out of this picture then copy and paste to excel file. the easiest way.
or first convert this into pdf then use google lens. go to file scroll to print option in print setting their is an option of MICROSOFT PRINT TO PDF select that and press print it will ask you for location then give it and use it

How to remove duplicate columns after a JOIN in Pig?

Let's say I JOIN two relations like:
-- part looks like:
-- 1,5.3
-- 2,4.9
-- 3,4.9
-- original looks like:
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,akhila,3.3,IT,C,1.3,0.3
jnd = JOIN part BY $0, original BY $0;
The output will be:
1,5.3,1,Anju,3.6,IT,A,1.6,0.3
2,4.9,2,Remya,3.3,EEE,B,1.6,0.3
3,4.9,3,akhila,3.3,IT,C,1.3,0.3
Notice that $0 is shown twice in each tuple. EG:
1,5.3,1,Anju,3.6,IT,A,1.6,0.3
^ ^
|-----|
I can remove the duplicate key manually by doing:
jnd = foreach jnd generate $0,$1,$3,$4 ..;
Is there a way to remove this dynamically? Like remove(the duplicate key joiner).
Have faced the same kind of issue while working on Data Set Joining and other data processing techniques where in output the column names get repeated.
So was working on UDF which will remove the duplicates column by using schema name of that field and retaining the first unique column occurrence data.
Pre-Requisite:
Name of all the fields should be present
You need to download this UDF file and make it jar so as to use it.
UDF file location from GitHub :
GitHub UDF Java File Location
We will take the above question as example.
--Data Set A contains this data
-- 1,5.3
-- 2,4.9
-- 3,4.9
--Data Set B contains this data
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,Akhila,3.3,IT,C,1.3,0.3
PIG Script:
REGISTER /home/user/
DSA = LOAD '/home/user/DSALOC' AS (ROLLNO:int,CGPA:float);
DSB = LOAD '/home/user/DSBLOC' AS (ROLLNO:int,NAME:chararray,SUB1:float,BRANCH:chararray,GRADE:chararray,SUB2:float);
JOINOP = JOIN DSA BY ROLLNO,DSB BY ROLLNO;
We will get column name after joining as
DSA::ROLLNO:int,DSA::CGPA:float,DSB::ROLLNO:int,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float
For making it to
DSA::ROLLNO:int,DSA::CGPA:float,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float
DSB::ROLLNO:int is removed.
We need to use the UDF as
JOINOP_NODUPLICATES = FOREACH JOINOP GENERATE FLATTEN(org.imagine.REMOVEDUPLICATECOLUMNS(*));
Where org.imagine.REMOVEDUPLICATECOLUMNS is the UDF.
This UDF removes duplicate columns by using Name in schema.So DSA::ROLLNO:int is retained and DSB::ROLLNO:int is removed from the dataset.

Extracting a column from a paragraph from a csv file using java

MAJOR ACC NO,MINOR ACC NO,STD CODE,TEL NO,DIST CODE
7452145,723456, 01,4213036,AAA
7254287,7863265, 01,2121920,AAA
FRUNDTE,FMACNO,FACCNO,FDISTCOD,FBILSEQ,FOOCTYP,FOOCDES,FOOCAMT,FSTD,FTELNO,FNORECON,FXFRACCN,FLANGIND,CUR
12345,71234,7643234,AAA,001,DX,WLR Promotion - Insitu /Pre-Cabled PSTN Connection,-37.87,,,0,,E,EUR
FRUNDTE,FMACNO,FACCNO,FDISTCOD,FBILSEQ,FORDNO,FREF,FCHGDES,FCHGAMT,CUR,FORENFRM,FORENTO
3242241,72349489,2345352,AAA,001,30234843P ,1,NEW CONNECTION - PRECABLED CHARGE,37.87,EUR,2123422,201201234
12123471,7618412389,76333232,AAA,001,3123443P ,2,BROKEN PERIOD RENTAL,5.40,EUR,201234523,20123601
I have a csv file something like the one above and I want to extract certain columns from it. For example I want to extract the first column of the first paragraph. I'm kind of new to java but I am able to read the file but I want to extract certain columns from different paragraphs. Any help will be appreciated.

How to store grouped records into multiple files with Pig?

After loading and grouping records, how can I store those grouped records into several files, one per group (=userid)?
records = LOAD 'input' AS (userid:int, ...);
grouped_records = GROUP records BY userid;
I'm using Apache Pig version 0.8.1-cdh3u3 (rexported)
Indeed, there is a MultiStorage class at Piggybank which does exactly what I want - it splits the records by a specified attribute (at index '0' in my example):
STORE records INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0', 'none', ',');
A = LOAD 'mydata' USING PigStorage() as (a, b, c);
STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', 'bz2', '\\t');
Parameters:
parentPathStr - Parent output dir path
splitFieldIndex - key field index
compression - 'bz2', 'bz', 'gz' or 'none'
fieldDel - Output record field delimiter.
Reference: GrepCode

Java XMLInputFactory - truncates text when reading data with .getData()

I'm using XMLInputFactory to read data (sql queries) from xml file.
In some cases, the data is truncated. For example:
select CASE WHEN count(*) > 0 THEN 'LX1VQMSSRV069 OK' ELSE 'LX1VQMSSRV069 NOK' END from [PIWSLog].[dbo].[log]
is read as (text is truncated after the last '.'):
select CASE WHEN count(*) > 0 THEN 'LX1VQMSSRV069 OK' ELSE 'LX1VQMSSRV069 NOK' END from [PIWSLog].[dbo]
I've tested with several string and it seems that the problem is with the char's in [].[].[]..
I'm readind data using:
mySQLquery = event.asCharacters().getData();
Another situation is if the string has '\n'. Like, if it has two '\n', the event.asCharacters().getData(); reads correctly, but if it has three '\n' it truncates the string after the second '\n'. This is very odd!
Any idea what's the problem and how can I solve it?
The XMLInputFactory API is not obliged to give you all of the characters of a String in one go. It's permitted to pass you a sequence of events, each containing a fragment of the string.
You'll probably find that if you read another event after the one containing the truncated string, you'll find the remainder of your string (possibly after several events).

Categories