I need to output my hadoop result in .csv format.
how will i do this?
My code :https://github.com/studhadoop/xml/blob/master/XmlParser11.java
should i simply include csvoutputFormat in my code.
I am using mapreduce API
myjob.sh
bin/hadoop jar /var/root/ALA/ala_jar/clsperformance.jar ala.clsperf.ClsPerf /user/root/ala_xmlrpt/Amrita\ Vidyalayam\,\ Karwar_Class\ 1\ B_ENG.xml /user/root/ala_xmlrpt-outputshell4
bin/hadoop fs -get /user/root/ala_xmlrpt-outputshell4/part-r-00000 /Users/jobsubmit
cat /Users/jobsubmit/part-r-00000 /Users/jobsubmit/output.csv
SOLUTION
ys i was missing > in cat
cat /Users/jobsubmit/part-r-00000> /Users/jobsubmit/output.csv
You can use TextOutputFormat. The default key/ value separator is a tab character. You can change the separator by setting the property "mapred.textoutputformat.separatorText" in your driver.
conf.set("mapred.textoutputformat.separatorText", ",");
Related
I have a Spark Dataset<Row> with lot of columns that have to be written to a text file with a tab delimiter. With csv its easy to specify that option, but how to handle this for a text file when using Java?
Option 1 :
yourDf
.coalesce(1) // if you want to save as single file
.write
.option("sep", "\t")
.option("encoding", "UTF-8")
.csv("outputpath")
same as writing csv but here tab delimeter you need to use.
Yes its csv as you mentioned in the comment, if you want to rename the file you can do the below..
import org.apache.hadoop.fs.FileSystem;
FileSystem fs = FileSystem.get(spark.sparkContext.hadoopConfiguration);
fs.rename(new Path("outputpath"), new Path(outputpath.txt))
Note :
1) you can use fs.globStatus if you have multiple file under your outputpath inthis case coalesce(1) will make single csv, hence not needed.
2) if you are using s3 instead of hdfs you may need to set below before attempting to rename...
spark.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
Option 2 :
Other option (if you don't want use csv api) could be like below
yourDf.rdd
.coalesce(1)
.map(x => x.mkString("\t"))
.saveAsTextFile("yourfile.txt")
I have a R script in which I want to call parameters from Java code. The parameters are csv file name file name and unique ID which has to be used to name the two output files.
My R script is :
df1 <- read.csv("filename.csv")
vs=colnames(df1)
md=formula(paste(vs[3],"~",vs[1],"+",vs[2]))
fit <- summary(aov(md, data=df1))[[1]]
#text output
names(fit)[1:4]=c("DF","SS","MS","F")
sink("test.txt")
In this code the first line df1 <- read.csv("filename.csv") should take file name dynamically from JAVA code and the last line sink("test.txt") should take unique ID and create the output file.
The java code is :
buildCommand.add("Rscript ");
buildCommand.add(scriptName);
buildCommand.add(inputFileWithPathExtension);
buildCommand.add(uniqueIdForR);
I have seen other post but I am unsure wether it will help in my case, also similar posts talking about rJava package`, but didn't get clear idea.
Any help will be highly appreciated. thanks in advance !
Here a very simple example for reading command line arguments in your case:
args <- commandArgs(TRUE)
input <- args[1]
output <- paste0(args[2], ".txt")
cat("Reading from", input, "\n")
cat("Writing to", output, "\n")
Example:
$ Rscript foo.R foo.csv 1234567
Reading from foo.csv
Writing to 1234567.txt
this is my first post. I'm new in Java. I'm working on file parser. I've tried to identify if it is CSV or another file format, but it looks like it is not quite a standard format. I'm working on apache camel solution (my first and last idea :( ), but maybe some of you recognize this kind of file format? Additionally, I've got .imp file for my output.
Here is my example input:
NrDok:FS-2222/17/W
Data:12.02.2017
SposobPlatn:GOT
NazwaWystawcy:MAAKAI Gawron
AdresWystawcy:33-123 bABA
KodWystawcy:33-112
MiastoWystawcy:bABA
UlicaWystawcy:czysfa 8
NIPWystawcy:123-19-85-123
NazwaOdbiorcy:abc abc-HANDLOWO-USŁUGOWE
AdresOdbiorcy:33-123 fghd
KodOdbiorcy:33-123
MiastoOdbiorcy:Tdsfs
UlicaOdbiorcy:dfdfdA 39
NIPOdbiorcy:82334349
TelefonOdbiorcy:654-522-124
NrOdbiorcyWSieciSklepow:efdsS-sffgsA
IloscLinii:1
Linia:Nazwa{ĆWIARTKA KG}Kod{C1}Vat{5}Jm{kg.}Asortyment{dfgv}Sww{}PKWIU{10.12.10}Ilosc{3.40}Cena{n3.21}Wartosc{n11.83}IleWOpak{1}CenaSp{b0.00}
DoZaplaty:252.32
And here is my example output file:
FH 2015.07.31 2015.07.31 F04443 Gotowka
FO 812-123-45-11 P.a.b.Uc"fdad" abcd deffF UL.fdfgdfdA 12/33 33-123 afvdf
FS 779-19-06-082 badfdf S.A. ul. Wisniowa 89 60-003 Poznan
FP 00218746 CHRZAN TARTY EXTRA POLONAISE 180G SZT 32.00 2.21 8 10.39.17.0 32.00 5900138000055
Is there any easy way to convert the first file to second file format? Maybe you know the type of this file? In a meanwhile, I'm continuing my work with apache camel.
Thanks in advance for your time and help!
I suggest you to play with https://tika.apache.org/1.1/detection.html#Mime_Magic_Detection
It's very good lib for file type recognition.
Here https://www.tutorialspoint.com/tika/tika_document_type_detection.htm we have simple example.
Your file can be read as standard Java .properties file. This type of files allows both = and : as key and value separators. While the fact that it contains non ISO-8859-1 characters like Polish Ć may prevent Java from correctly parsing it.
This line
Nazwa{ĆWIARTKA KG}Kod{C1}Vat{5}Jm{kg.}Asortyment{dfgv}Sww{}PKWIU{10.12.10}Ilosc{3.40}Cena{n3.21}Wartosc{n11.83}IleWOpak{1}CenaSp{b0.00}
Seem to be some custom serialization format of the object in the form
key1{value1}key2{value2}...
Your output file contains lots of data that is not listed in the input which makes me think that there is some data querying from external systems to build the output. You should investigate it yourself. There is no way anyone can guess the transformation with provided input.
I want table data from PDF and I am using below command to get table data
java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -t example.pdf
But in this, two column data get mixed in some rows,
so I want to specify column coordinates for getting the perfect data,
but I don't know how to get column coordinate,
so anyone can guide me with perfect command would be helpful.
Thanks in advance!
You can specify the column coordinates using the -c or --columns parameter. The coordinates you specify will be the coordinates of the delineators between columns. So if one column goes from 10.5 to 13.5 and the next column goes from 13.5 to 17.5 then you only list 13.5. You will also need to turn guess off. You didn't provide an example pdf so I can't provide you with the correct coordinates but your command would look something like this:
java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -c 15.7,17.3,19.2,33.2,70.1,100.7,200.6,300.7 -t example.pdf -g False
You can read more about the different options for getting your command just right from the help command:
$ java -jar target/tabula-1.0.1-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f
<FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r]
[-s <PASSWORD>] [-t] [-u] [-v]
Tabula helps you extract tables from PDFs
-a,--area <AREA> Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire
page
-b,--batch <DIRECTORY> Convert all .pdfs in the provided directory.
-c,--columns <COLUMNS> X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
-d,--debug Print detected table areas instead of
processing.
-f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV
-g,--guess Guess the portion of the page to analyze per
page.
-h,--help Print this help text.
-i,--silent Suppress all stderr output.
-l,--lattice Force PDF to be extracted using lattice-mode
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF
not to be extracted using spreadsheet-style
extraction (if there are no ruling lines
separating each cell)
-o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.
Default: -
-p,--pages <PAGES> Comma separated list of ranges, or all.
Examples: --pages 1-3,5-7, --pages 3 or
--pages all. Default is --pages 1
-r,--spreadsheet [Deprecated in favor of -l/--lattice] Force
PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-s,--password <PASSWORD> Password to decrypt document. Default is empty
-t,--stream Force PDF to be extracted using stream-mode
extraction (if there are no ruling lines
separating each cell)
-u,--use-line-returns Use embedded line returns in cells. (Only in
spreadsheet mode.)
-v,--version Print version and exit.
//Java Code to export as CSV the using FileWriter.
Java Code :
FileWriter fileWriterForCsv;
fileWriterForCsv.append(4.500);
fileWriterForCsv.append(",");
fileWriterForCsv.append("Mani");
fileWriterForCsv.append(",");
fileWriterForCsv.append("March");
// Content of the CSV File Exported mentioned below
Generated CSV
______________
Actual Result : 4.5,Mani,March
Expected Result : 4.500,Mani,March
Please let me know whether i need to change the java code?? or how to proceed to get the expected result as above mentioned
Also tried to change the column type as text in CSV template. Not getting the expected result.
Try formatting the number output. This should write the whole line with a CRLF at the end.
fileWriterForCsv.format("%.3f,%s,%s%n", 4.5, "Mani", "March");