Spark Save as Text File grouped by Key - java

I would like to save RDD to text file grouped by key, currently I can't figure out how to split the output to multiple files, it seems all the output spanning across multiple keys which share the same partition gets written to the same file. I would like to have different files for each key. Here's my code snippet :
JavaPairRDD<String, Iterable<Customer>> groupedResults = customerCityPairRDD.groupByKey();
groupedResults.flatMap(x -> x._2().iterator())
.saveAsTextFile(outputPath + "/cityCounts");

This can be achieved by using foreachPartition to save each partitions into separate file.
You can develop your code as follows
groupedResults.foreachPartition(new VoidFunction<Iterator<Customer>>() {
#Override
public void call(Iterator<Customer> rec) throws Exception {
FSDataOutputStream fsoutputStream = null;
BufferedWriter writer = null;
try {
fsoutputStream = FileSystem.get(new Configuration()).create(new Path("path1"))
writer = new BufferedWriter(fsoutputStream)
while (rec.hasNext()) {
Customer cust = rec.next();
writer.write(cust)
}
} catch (Exception exp) {
exp.printStackTrace()
//Handle exception
}
finally {
// close writer.
}
}
});
Hope this helps.
Ravi

So I figured how to solve this. Convert RDD to Dataframe and then just partition by key during write.
Dataset<Row> dataFrame = spark.createDataFrame(customerRDD, Customer.class);
dataFrame.write()
.partitionBy("city")
.text("cityCounts"); // write as text file at file path cityCounts

Related

how to write parquet files in java with apache arrow

I am trying to write data in java into apache parquet. So far, what i've done is use apache arrow via the examples here: https://arrow.apache.org/cookbook/java/schema.html#creating-fields and create an arrow format dataset.
Question is, how do I write it into parquet after that? Also, do I need to use apache arrow to output the data as a parquet file? or can I use apache parquet directly to serialize the data and then output it as a parquet file?
what i've done:
try (BufferAllocator allocator = new RootAllocator()) {
Field name = new Field("name", FieldType.nullable(new ArrowType.Utf8()), null);
Field age = new Field("age", FieldType.nullable(new ArrowType.Int(32, true)), null);
Schema schemaPerson = new Schema(asList(name, age));
try(
VectorSchemaRoot vectorSchemaRoot = VectorSchemaRoot.create(schemaPerson, allocator)
){
VarCharVector nameVector = (VarCharVector) vectorSchemaRoot.getVector("name");
nameVector.allocateNew(3);
nameVector.set(0, "David".getBytes());
nameVector.set(1, "Gladis".getBytes());
nameVector.set(2, "Juan".getBytes());
IntVector ageVector = (IntVector) vectorSchemaRoot.getVector("age");
ageVector.allocateNew(3);
ageVector.set(0, 10);
ageVector.set(1, 20);
ageVector.set(2, 30);
vectorSchemaRoot.setRowCount(3);
File file = new File("randon_access_to_file.arrow");
try (
FileOutputStream fileOutputStream = new FileOutputStream(file);
ArrowFileWriter writer = new ArrowFileWriter(vectorSchemaRoot, null, fileOutputStream.getChannel())
) {
writer.start();
writer.writeBatch();
writer.end();
System.out.println("Record batches written: " + writer.getRecordBlocks().size() + ". Number of rows written: " + vectorSchemaRoot.getRowCount());
} catch (IOException e) {
e.printStackTrace();
}
}
}
but this outputs as an arrow file. not a parquet. Any ideas how I can output this to parquet file instead? And do i need arrow to generate a parquet file to begin with - or can i just use parquet directly?
Arrow Java does not yet support writing to Parquet files, but you can use Parquet to do that.
There is some code in the Arrow dataset test classes that may help. See
org.apache.arrow.dataset.ParquetWriteSupport;
org.apache.arrow.dataset.file.TestFileSystemDataset;
The second class has some tests that use the utilities in the first one.
You can find them on GitHub here:
https://github.com/apache/arrow/tree/master/java/dataset/src/test/java/org/apache/arrow/dataset

How to export different tables from mysql database to different XML file using dbunit?

I am trying to export a database into XML file using DBUNIT. I am facing problem while generating separate XML for each table. I could not able to do this.
Can someone help me with this?
Following is the code:
`
QueryDataSet partialDataSet = new QueryDataSet(connection);
addTables(partialDataSet);
// XML file into which data needs to be extracted
FlatXmlDataSet.write(partialDataSet, new FileOutputStream("C:/Users/name/Desktop/test-dataset_temp.xml"));
System.out.println("Data set written");
static private void addTables(QueryDataSet dataSet) {
if (tableList == null) return;
for (Iterator k = tableList.iterator(); k.hasNext(); ) {
String table = (String) k.next();
try {
dataSet.addTable(table);
} catch (AmbiguousTableNameException e) {
e.printStackTrace();
}
}
}`
Now my problem is how to I seperate tables so that I can generate seperate xml file for each table.
Thanks in Advance

How to append data to File using SuperCSV using CsvListWriter

I have a method to write data to a file.
public void writeCSFFileData(List<String> fileData){
try {
CsvListWriter csvWriter = new CsvListWriter(new FileWriter("/path/file.csv"), CsvPreference.STANDARD_PREFERENCE);
csvWriter.write(fileData);
csvWriter.close();
} catch (Exception e) {
SimpleLogger.getInstance().writeError(e);
}
The above method is called several times to write to a file.
But, each time the file is not appended instead it is overwritten.
Thanks in advance.
I found the solution myself, I just need to add true in FileWriter to append the data.
Ex: new FileWriter("/path/file.csv",true)

How to read files with an offset from Hadoop using Java

Problem: I want to read a section of a file from HDFS and return it, such as lines 101-120 from a file of 1000 lines.
I don't want to use seek because I have read that it is expensive.
I have log files which I am using PIG to process down into meaningful sets of data. I've been writing an API to return the data for consumption and display by a front end. Those processed data sets can be large enough that I don't want to read the entire file out of Hadoop in one slurp to save wire time and bandwidth. (Let's say 5 - 10MB)
Currently I am using a BufferedReader to return small summary files which is working fine
ArrayList lines = new ArrayList();
...
for (FileStatus item: items) {
// ignoring files like _SUCCESS
if(item.getPath().getName().startsWith("_")) {
continue;
}
in = fs.open(item.getPath());
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line;
line = br.readLine();
while (line != null) {
line = line.replaceAll("(\\r|\\n)", "");
lines.add(line.split("\t"));
line = br.readLine();
}
}
I've poked around the interwebs quite a bit as well as Stack but haven't found exactly what I need.
Perhaps this is completely the wrong way to go about doing it and I need a completely separate set of code and different functions to manage this. Open to any suggestions.
Thanks!
As added noted based on research from the below discussions:
How does Hadoop process records records split across block boundaries?
Hadoop FileSplit Reading
I think SEEK is a best option for reading files with huge volumes. It did not cause any problems to me as the volume of data that i was reading was in the range of 2 - 3GB. I did not encounter any issues till today but we did use file splitting to handle the large data set. below is the code which you can use for reading purpose and test the same.
public class HDFSClientTesting {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try{
//System.loadLibrary("libhadoop.so");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
conf.addResource(new Path("core-site.xml"));
String Filename = "/dir/00000027";
long ByteOffset = 3185041;
SequenceFile.Reader rdr = new SequenceFile.Reader(fs, new Path(Filename), conf);
Text key = new Text();
Text value = new Text();
rdr.seek(ByteOffset);
rdr.next(key,value);
//Plain text
JSONObject jso = new JSONObject(value.toString());
String content = jso.getString("body");
System.out.println("\n\n\n" + content + "\n\n\n");
File file =new File("test.gz");
file.createNewFile();
}
catch (Exception e ){
throw new RuntimeException(e);
}
finally{
}
}
}

HL7 parsing to get ORC-2

I am having trouble reading the ORC-2 field from ORM^O01 order message. I am using HapiStructures-v23-1.2.jar to read but this method(getFillerOrdersNumber()) is returning null value
MSH|^~\\&|recAPP|20010|BIBB|HCL|20110923192607||ORM^O01|11D900220|D|2.3|1\r
PID|1|11D900220|11D900220||TEST^FOURTYONE||19980808|M|||\r
ZRQ|1|11D900220||CHARTMAXX TESTING ACCOUNT 2|||||||||||||||||Y\r
ORC|NW|11D900220||||||||||66662^NOT INDICATED^X^^^^^^^^^^U|||||||||CHARTMAXX
TESTING ACCOUNT 2|^695 S.BROADWAY^DENVER^CO^80209\r
OBR|1|11D900220||66^BHL, 9P21 GENOTYPE^L|NORMAL||20110920001800|
||NOTAVAILABLE|N||Y|||66662^NOT INDICATED^X^^^^^^^^^^U\r
I want to parse this message and read the ORC-2 field and save it in the database
public static string getOrderNumber(){
Message hapiMsg = null;
Parser p = new GenericParser();
p.setValidationContext(null);
try {
hapiMsg = p.parse(hl7Message);
} catch (Exception e) {
Logger.error(e);
}
Terser terser = new Terser(hapiMsg);
try {
ORM_O01 getOrc = (ORM_O01)hapiMsg;
ORC orc = new ORC(getOrc, null);
String fn= orc.getFillerOrderNumber().toString();
}catch(Exception e){
logger.error(e);
}
return fn;
}
I read in some posts that I have to ladder through to reach the ORC OBR and NTE segments. can someone help me how to do this with a piece of code. Thanks in advance
First I have to point out that ORC-2 is Placer Order Number and ORC-3 is Filler Order Number, not the other way round. So, what you might want to do is this:
ORM_O01 msg = ...
ORC orc = msg.getORDER().getORC();
String placerOrderNumber =
orc.getPlacerOrderNumber().getEntityIdentifier().getValue();
String fillerOrderNumber =
orc.getFillerOrderNumber().getEntityIdentifier().getValue();
I would suggest you to read Hapi documentation yourself: http://hl7api.sourceforge.net/v23/apidocs/index.html
Based on this code:
ORM_O01 getOrc = (ORM_O01)hapiMsg;
ORC orc = new ORC(getOrc, null);
String fn= orc.getFillerOrderNumber().toString();
It looks like you are creating a new ORC rather than pulling out the existing one from the message. I unfortunately can't provide the exact code as I'm only familiar with HL7, not HAPI.
EDIT: It looks like you may be able to do ORC orc = getOrc.getORDER().getORC();

Categories