I am having multiple avro files under a directory which reside on hadoop environment, I need to merge all these files and make it as a single avro file.
example
/abc->
x.avro
y.avro } => a.avro
z.avro
The file a.avro will contain contents of all x,y,z files, where x,y,z files having same schema. I need to create a java application. Any help appreciated.
Thanks.
There are few tools provided by the apache avro in order to deal with the avro file operations here. These tools include Merging/Concat tool which merge same schema avro file with non-reserved metadata, catTool to extract samples from an Avro data file, conversion tool which Converts an input file from Avro binary into JSON, recoveryTool which Recovers data from a corrupt Avro Data file etc(Find more on the github url mentioned).
I have extract the code from the same tools mentioned on github, here is the java application that does solve your purpose.
Path inPath = new Path("C:\\Users\\vaijnathp\\IdeaProjects\\MSExcel\\vaj");
Path outPath = new Path("getDestinationPath") ;
FileSystem fs = FileSystem.get(new Configuration());
FileStatus [] contents contents = fs.listStatus(inPath, new OutputLogFilter());
DataFileWriter<GenericRecord> writer = new DataFileWriter<>(new GenericDatumWriter<>());
Schema schema = null;
String inputCodec = null;
Map<String, byte[]> metadata = new TreeMap<>();
BufferedOutputStream output = new BufferedOutputStream(new BufferedOutputStream(fs.create(outPath)));
for (int i = 0; i < contents.length; i++) {
FileStatus folderContent = contents[i];
if (folderContent.isFile() && folderContent.getPath().getName().endsWith(".avro")) {
InputStream input = new BufferedInputStream(fs.open(folderContent.getPath()));
DataFileStream<GenericRecord> reader = new DataFileStream<>(input, new GenericDatumReader<GenericRecord>());
if (schema == null) {
schema = reader.getSchema();
//extract metadata for further check.
extractAvroFileMetadata(writer, metadata, reader);
inputCodec = reader.getMetaString(DataFileConstants.CODEC);
if (inputCodec == null) inputCodec = DataFileConstants.NULL_CODEC;
writer.setCodec(CodecFactory.fromString(inputCodec));
writer.create(schema, output);
} else {
if (!schema.equals(reader.getSchema())) reader.close();
//compare FileMetadata with previously extracted one
CompareAvroFileMetadata(metadata, reader, folderContent.getPath().getName());
String thisCodec = reader.getMetaString(DataFileConstants.CODEC);
if (thisCodec == null) thisCodec = DataFileConstants.NULL_CODEC;
if (!inputCodec.equals(thisCodec)) reader.close();
}
writer.appendAllFrom(reader, false);
reader.close();
}
}
writer.close();
}catch (Exception e){
e.printStackTrace();
}
I hope this code snippet will help you create your java application. Thanks.
Related
I've tried many different examples online and from stackoverflow, but to no avail.
There is no passphrase, just either .gpg or .asc pub/sec files.
I'm using Kleopatra to export the .asc files, according to the provided .gpg files from the previous project maintainer who left the company.
I'm getting the same error as in BouncyCastle Open PGP - unknown object in stream 47
File decryptFile(File file){
String newFileName = "/tmp/encrypted-" + file.getName().replace(".gpg", "")
File newFile = new File(newFileName)
File tempFile = getFileFromResource('seckeyascii.asc')
// Just to see that files are read properly, it's fine
logger.log(tempFile.getText())
// Attempt 1
new Decryptor(
// new Key(getFileFromResource(RESOURCE_PUBRING)),
// new Key(getFileFromResource(RESOURCE_SECRING))
new Key(getFileFromResource('pubkeyascii.asc')),
new Key(getFileFromResource('seckeyascii.asc'))
).decrypt(file, newFile)
// Attempt 2
File pubringFile = getFileFromResource(RESOURCE_PUBRING)
File secringFile = getFileFromResource(RESOURCE_SECRING)
KeyringConfig keyringConfig = KeyringConfigs.withKeyRingsFromFiles(pubringFile, secringFile, KeyringConfigCallbacks.withUnprotectedKeys())
try {
FileInputStream cipherTextStream = new FileInputStream(file.getPath())
FileOutputStream fileOutput = new FileOutputStream(newFileName)
BufferedOutputStream bufferedOut = new BufferedOutputStream(fileOutput)
InputStream plaintextStream = BouncyGPG
.decryptAndVerifyStream()
.withConfig(keyringConfig)
.andIgnoreSignatures()
.fromEncryptedInputStream(cipherTextStream)
}
catch (Exception e){
logger.log("Error decrypting file: ${e.getMessage()}")
return null
}
finally {
Streams.pipeAll(plaintextStream, bufferedOut)
}
return newFile
}
What I'm getting in all tries is:
Iterator failed to get next object: unknown object in stream: 47
I tried converting the key files into ANSI or ASCII Western Europe/OEM850, didn't help.
This is using different bouncycastle libs like name.neuhalfen.projects.crypto.bouncycastle.openpgp or org.c02e.jpgpj
I have scheduled an export in my azure blob storage account, it's a monhtly run which creates a csv file under folder like dir1 / dir2 / dir3 / StartDateOfMonth-EndDateOfMonth.
I have below things to do.
1- I want to read this file in java without downloading it.
2 - Want to read parallelly using spring batch master-worker pattern.
Issue facing:-
1- I am not getting absolute path using below line
CloudAppendBlob cloudAppendBlob= container.getAppendBlobReference("blob_file_name");
log.info("cloudAppendBlob.getUri().getPath() = {}",cloudAppendBlob.getUri().getPath());
2- If anyone helps me to how to do it in Spring Batch master-worker pattern, it would be very helpful for me. [ Normal Spring Batch master-worker pattern I know for a CSV to read it file from local path ]
1- I want to read this file in java without downloading it.
You can use one of the file item readers (flat file, xml file, json file, etc) provided by Spring Batch and configure it with a org.springframework.core.io.UrlResource. Here is a quick example:
UrlResource resource = new UrlResource("remote/url/to/your/file");
FlatFileItemReader<String> itemReader = new FlatFileItemReaderBuilder<String>()
.resource(resource)
// set other properties
.build();
2 - Want to read parallelly using spring batch master-worker pattern.
You can use the remote partitioning technique provided by Spring Batch where each file is processed in a partition (ie a worker per file). Spring Batch provides the MultiResourcePartitioner that was specifically designed for that. You can find more details in the Partitioning section and a complete example here.
I have found a solution for downloading the .csv files from Azure Blob Storage in Java with folder structure as 'dir1 / dir2 / dir3 / StartDateOfMonth-EndDateOfMonth'
#Override
public List listBlobs(String containerName) {
List uris = new ArrayList<>();
String fileName=null;
try {
CloudBlobContainer container = cloudBlobClient.getContainerReference(containerName);
Iterable<ListBlobItem> blobs = container.listBlobs("$Directory", true); //for $Directory please find screenshot I have given below. this is the name that you provide during the creation of Export in your Azure Storage account
BlobServiceClient blobServiceClient = new BlobServiceClientBuilder().connectionString(environment.getProperty("azure.storage.ConnectionString")).buildClient();
BlobContainerClient containerClient = blobServiceClient.getBlobContainerClient(containerName);
FileOutputStream fout = null;
for (ListBlobItem fileBlob : blobs) {
log.info("fileBlob instanceof CloudBlob = {}",fileBlob instanceof CloudBlob);
if (fileBlob instanceof CloudBlob) {
CloudBlob cloudBlob = (CloudBlob) fileBlob;
uris.add(cloudBlob.getName());
log.info("File Name is = {}", cloudBlob.getName());
BlobClient blobClient = containerClient.getBlobClient(cloudBlob.getName());
System.out.println(blobClient.getBlobUrl());
System.out.println(blobClient.getBlobUrl().trim());
if (blobClient.exists()) {
Path p = Paths.get(cloudBlob.getName());
String file = p.getFileName().toString();
String directory = p.getParent().toString();
log.info("Downloading Blob File = {} from Directory {}",file,directory);
File dir = new File("$LOCAL_PATH"+directory);
dir.mkdirs();
fout = new FileOutputStream("$LOCAL_PATH" + cloudBlob.getName());
blobClient.download(fout);
CloudAppendBlob cloudAppendBlob= container.getAppendBlobReference(cloudBlob.getName());
uris.add(cloudAppendBlob.getUri().toURL());
log.info("cloudAppendBlob.getUri().getPath() = {}",cloudAppendBlob.getUri().toURL());
}
}
}
for (ListBlobItem blobItem : container.listBlobs()) {
uris.add(blobItem.getUri().toURL());
//System.out.println("blobItem.getUri().getPath()= "+blobItem.getUri().getPath());
}
} catch (StorageException e) {
e.printStackTrace();
} catch (URISyntaxException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return uris;
}
This code will download all the files from all sub-directories, to download from the specific directory of the month, you can add a check for date matching for the directory name.
I am trying to read a file with a specific name which exists in multiple .gz files within a folder. For example
D:/sample_datasets/gzfiles
|-my_file_1.tar.gz
|-my_file_1.tar
|-file1.csv
|-file2.csv
|-file3.csv
|-my_file_2.tar.gz
|-my_file_2.tar
|-file1.csv
|-file2.csv
|-file3.csv
I am only interested in reading contents of file1.csv which has the same schema across all the .gz files.I am passing the path D:/sample_datasets/gzfiles to the wholeTextFiles() method in JavaSparkContext. However, it returns the contents of all the files in within the tar viz. file1.csv, file2.csv, file3.csv.Is there a way I can only read the contents of file1.csv in Dataset or an RDD. Thanks in advance!
use *.gz at the end of the path.
Hope this helps!
I was able to perform the process using the following snippet I used from multiple answers on SO
JavaPairRDD tarData = sparkContext.binaryFiles("D:/sample_datasets/gzfiles/*.tar.gz");
JavaRDD tarRecords = tarData.flatMap(new FlatMapFunction, Row>(){
private static final long serialVersionUID = 1L;
#Override
public Iterator call(Tuple2 t) throws Exception {
TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);
List records = new ArrayList();
TarArchiveInputStream tarInput = new TarArchiveInputStream(new GzipCompressorInputStream(t._2.open()));
TarArchiveEntry entry;
while((entry = tarInput.getNextTarEntry()) != null) {
if(entry.getName().equals("file1.csv")) {
InputStreamReader streamReader = new InputStreamReader(tarInput);
BufferedReader reader = new BufferedReader(streamReader);
String line;
while((line = reader.readLine())!= null) {
String [] parsedLine = parser.parseLine(line);
Row row = RowFactory.create(parsedLine);
records.add(row);
}
reader.close();
break;
}
}
tarInput.close();
return records.iterator();
}
});
I have the following situation:
I am able to zip my files with the following method:
public boolean generateZip(){
byte[] application = new byte[100000];
ByteArrayOutputStream baos = new ByteArrayOutputStream();
// These are the files to include in the ZIP file
String[] filenames = new String[]{"/subdirectory/index.html", "/subdirectory/webindex.html"};
// Create a buffer for reading the files
try {
// Create the ZIP file
ZipOutputStream out = new ZipOutputStream(baos);
// Compress the files
for (int i=0; i<filenames.length; i++) {
byte[] filedata = VirtualFile.fromRelativePath(filenames[i]).content();
ByteArrayInputStream in = new ByteArrayInputStream(filedata);
// Add ZIP entry to output stream.
out.putNextEntry(new ZipEntry(filenames[i]));
// Transfer bytes from the file to the ZIP file
int len;
while ((len = in.read(application)) > 0) {
out.write(application, 0, len);
}
// Complete the entry
out.closeEntry();
in.close();
}
// Complete the ZIP file
out.close();
} catch (IOException e) {
System.out.println("There was an error generating ZIP.");
e.printStackTrace();
}
downloadzip(baos.toByteArray());
}
This works perfectly and I can download the xy.zip which contains the following directory and file structure:
subdirectory/
----index.html
----webindex.html
My aim is to completely leave out the subdirectory and the zip should only contain the two files. Is there any way to achieve this?
(I am using Java on Google App Engine).
Thanks in advance
If you are sure the files contained in the filenames array are unique if you leave out the directory, change your line for constructing ZipEntrys:
String zipEntryName = new File(filenames[i]).getName();
out.putNextEntry(new ZipEntry(zipEntryName));
This uses java.io.File#getName()
You can use Apache Commons io to list all your files, then read them to an InputStream
Replace the line below
String[] filenames = new String[]{"/subdirectory/index.html", "/subdirectory/webindex.html"}
with the following
Collection<File> files = FileUtils.listFiles(new File("/subdirectory"), new String[]{"html"}, true);
for (File file : files)
{
FileInputStream fileStream = new FileInputStream(file);
byte[] filedata = IOUtils.toByteArray(fileStream);
//From here you can proceed with your zipping.
}
Let me know if you have issues.
You could use the isDirectory() method on VirtualFile
Use Case
I need to package up our kml which is in a String into a kmz response for a network link in Google Earth. I would like to also wrap up icons and such while I'm at it.
Problem
Using the implementation below I receive errors from both WinZip and Google Earth that the archive is corrupted or that the file cannot be opened respectively. The part that deviates from other examples I'd built this from are the lines where the string is added:
ZipEntry kmlZipEntry = new ZipEntry("doc.kml");
out.putNextEntry(kmlZipEntry);
out.write(kml.getBytes("UTF-8"));
Please point me in the right direction to correctly write the string so that it is in doc.xml in the resulting kmz file. I know how to write the string to a temporary file, but I would very much like to keep the operation in memory for understandability and efficiency.
private static final int BUFFER = 2048;
private static void kmz(OutputStream os, String kml)
{
try{
BufferedInputStream origin = null;
ZipOutputStream out = new ZipOutputStream(os);
out.setMethod(ZipOutputStream.DEFLATED);
byte data[] = new byte[BUFFER];
File f = new File("./icons"); //folder containing icons and such
String files[] = f.list();
if(files != null)
{
for (String file: files) {
LOGGER.info("Adding to KMZ: "+ file);
FileInputStream fi = new FileInputStream(file);
origin = new BufferedInputStream(fi, BUFFER);
ZipEntry entry = new ZipEntry(file);
out.putNextEntry(entry);
int count;
while((count = origin.read(data, 0, BUFFER)) != -1) {
out.write(data, 0, count);
}
origin.close();
}
}
ZipEntry kmlZipEntry = new ZipEntry("doc.kml");
out.putNextEntry(kmlZipEntry);
out.write(kml.getBytes("UTF-8"));
}
catch(Exception e)
{
LOGGER.error("Problem creating kmz file", e);
}
}
Bonus points for showing me how to put the supplementary files from the icons folder into a similar folder within the archive as opposed to at the same layer as the doc.kml.
Update Even when saving the string to a temp file the errors occur. Ugh.
Use Case Note The use case is for use in a web app, but the code to get the list of files won't work there. For details see how-to-access-local-files-on-server-in-jboss-application
You forgot to call close() on ZipOutputStream. Best place to call it is the finally block of the try block where it's been created.
Update: To create a folder, just prepend its name in the entry name.
ZipEntry entry = new ZipEntry("icons/" + file);