Read specific file from multiple .gz file in Spark - java

I am trying to read a file with a specific name which exists in multiple .gz files within a folder. For example
D:/sample_datasets/gzfiles
|-my_file_1.tar.gz
|-my_file_1.tar
|-file1.csv
|-file2.csv
|-file3.csv
|-my_file_2.tar.gz
|-my_file_2.tar
|-file1.csv
|-file2.csv
|-file3.csv
I am only interested in reading contents of file1.csv which has the same schema across all the .gz files.I am passing the path D:/sample_datasets/gzfiles to the wholeTextFiles() method in JavaSparkContext. However, it returns the contents of all the files in within the tar viz. file1.csv, file2.csv, file3.csv.Is there a way I can only read the contents of file1.csv in Dataset or an RDD. Thanks in advance!

use *.gz at the end of the path.
Hope this helps!

I was able to perform the process using the following snippet I used from multiple answers on SO
JavaPairRDD tarData = sparkContext.binaryFiles("D:/sample_datasets/gzfiles/*.tar.gz");
JavaRDD tarRecords = tarData.flatMap(new FlatMapFunction, Row>(){
private static final long serialVersionUID = 1L;
#Override
public Iterator call(Tuple2 t) throws Exception {
TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);
List records = new ArrayList();
TarArchiveInputStream tarInput = new TarArchiveInputStream(new GzipCompressorInputStream(t._2.open()));
TarArchiveEntry entry;
while((entry = tarInput.getNextTarEntry()) != null) {
if(entry.getName().equals("file1.csv")) {
InputStreamReader streamReader = new InputStreamReader(tarInput);
BufferedReader reader = new BufferedReader(streamReader);
String line;
while((line = reader.readLine())!= null) {
String [] parsedLine = parser.parseLine(line);
Row row = RowFactory.create(parsedLine);
records.add(row);
}
reader.close();
break;
}
}
tarInput.close();
return records.iterator();
}
});

Related

Java program ignoring all the files inside the zip file [duplicate]

This question already has answers here:
How to unzip files recursively in Java?
(10 answers)
Closed last month.
I have program when I give a zip folder path via console. It will go through each item inside that folder (every child item, children of child, etc..). But if it encounters a zip folder it will ignore everything inside the zip folder, I need to read everything including files inside zip folders.
Here is the method that goes through each item:
public static String[] getLogBuffers(String path) throws IOException//path is given via console
{
String zipFileName = path;
String destDirectory = path;
BufferedInputStream errorLogBuffer = null;
BufferedInputStream windowLogBuffer = null;
String strErrorLogFileContents="";
String strWindowLogFileContents="";
String[] errorString=new String[2];
byte[] buffer = new byte[1024];
ZipInputStream zis = new ZipInputStream(new FileInputStream(zipFileName));
ZipEntry zipEntry = zis.getNextEntry();
while (zipEntry != null)
{
String filePath = destDirectory + "/" + zipEntry.getName();
System.out.println("unzipping" + filePath);
if (!zipEntry.isDirectory())
{
if (zipEntry.getName().endsWith("errorlog.txt"))
{
ZipFile zipFile = new ZipFile(path);
InputStream errorStream = zipFile.getInputStream(zipEntry);
BufferedInputStream bufferedInputStream=new BufferedInputStream(errorStream);
byte[] contents = new byte[1024];
System.out.println("ERRORLOG NAME"+zipEntry.getName());
int bytesRead = 0;
while((bytesRead = bufferedInputStream.read(contents)) != -1) {
strErrorLogFileContents += new String(contents, 0, bytesRead);
}
}
if (zipEntry.getName().endsWith("windowlog.txt"))
{ ZipFile zipFile = new ZipFile(path);
InputStream windowStream = zipFile.getInputStream(zipEntry);
BufferedInputStream bufferedInputStream=new BufferedInputStream(windowStream);
byte[] contents = new byte[1024];
System.out.println("WINDOWLOG NAME"+zipEntry.getName());
int bytesRead = 0;
while((bytesRead = bufferedInputStream.read(contents)) != -1) {
strWindowLogFileContents += new String(contents, 0, bytesRead);
}
}
}
zis.closeEntry();
zipEntry = zis.getNextEntry();
}
errorString[0]=strErrorLogFileContents;
errorString[1]=strWindowLogFileContents;
zis.closeEntry();
zis.close();
System.out.println("Buffers ready");
return errorString;
}
Items accessed inside the parent zip folder (my console output):
unzippingC:logFolders/logX3.zip/logX3/
unzippingC:logFolders/logX3.zip/logX3/Anan/
unzippingC:logFolders/logX3.zip/logX3/Anan/errorreports/
unzippingC:logFolders/logX3.zip/logX3/Anan/errorreports/2021-11-23_103518.zip
unzippingC:logFolders/logX3.zip/logX3/Anan/errorreports/errorlog.txt
unzippingC:logX3.zip/logX3/Anan/errorreports/version.txt
unzippingC:logFolders/logX3.zip/logX3/Anan/errorreports/windowlog.txt
As you can see the program only go until 2021-11-23_103518.zip and goes in another path after that but 2021-11-23_103518.zip has children items(files) that I need to access
appreciate any help, thanks
A zip file is not a folder. Although Windows treats a zip file as if it’s a folder,* it is not a folder. A .zip file is a single file with an internal table of entries, each containing compressed data.
Each inner .zip file you read requires a new ZipFile or ZipInputStream. There is no way around that.
You should not create new ZipFile instances to read the same .zip file’s entries. You only need one ZipFile object. You can go through its entries with its entries() method, and you can read each entry with the ZipFile’s getInputStream method.
(I wouldn’t be surprised if using multiple objects to read the same zip file were to run into file locking problems on Windows.)
try (ZipFile zipFile = new ZipFile(path))
{
Enumeration<? extends ZipEntry> entries = zipFile.entries();
while (entries.hasMoreElements())
{
ZipEntry zipEntry = entries.nextElement();
if (zipEntry.getName().endsWith("errorlog.txt"))
{
try (InputStream errorStream = zipFile.getInputStream(zipEntry))
{
// ...
}
}
}
}
Notice that no other ZipFile or ZipInputStream objects are created. Only zipFile reads and traverses the file. Also notice the use of a try-with-resources statement to implicitly close the ZipFile and the InputStream.
You should not use += to build a String. Doing so creates a lot of intermediate String objects which will have to be garbage collected, which can hurt your program’s performance. You should wrap each zip entry’s InputStream in an InputStreamReader, then use that Reader’s transferTo method to append to a single StringWriter that holds your combined log.
String strErrorLogFileContents = new StringWriter();
String strWindowLogFileContents = new StringWriter();
try (ZipFile zipFile = new ZipFile(path))
{
Enumeration<? extends ZipEntry> entries = zipFile.entries();
while (entries.hasMoreElements())
{
ZipEntry zipEntry = entries.nextElement();
if (zipEntry.getName().endsWith("errorlog.txt"))
{
try (Reader entryReader = new InputStreamReader(
zipFile.getInputStream(zipEntry),
StandardCharsets.UTF_8))
{
entryReader.transferTo(strErrorLogFileContents);
}
}
}
}
Notice the use of StandardCharsets.UTF_8. It is almost never correct to create a String from bytes without specifying the Charset. If you don’t provide the Charset, Java will use the system’s default Charset, which means your program will behave differently in Windows than it will on other operating systems.
If you are stuck with Java 8, you won’t have the transferTo method of Reader, so you will have to do the work yourself:
if (zipEntry.getName().endsWith("errorlog.txt"))
{
try (Reader entryReader = new BufferedReader(
new InputStreamReader(
zipFile.getInputStream(zipEntry),
StandardCharsets.UTF_8)))
{
int c;
while ((c = entryReader.read()) >= 0)
{
strErrorLogFileContents.write(c);
}
}
}
The use of BufferedReader means you don’t need to create your own array and implement bulk reads yourself. BufferedReader already does that for you.
As mentioned above, a zip entry which is itself an inner zip file requires a brand new ZipFile or ZipInputStream object to read it. I recommend copying the entry to a temporary file, since reading from a ZipInputStream made from another ZipInputStream is known to be slow, then deleting the temporary file after you’re done reading it.
try (ZipFile zipFile = new ZipFile(path))
{
Enumeration<? extends ZipEntry> entries = zipFile.entries();
while (entries.hasMoreElements())
{
ZipEntry zipEntry = entries.nextElement();
if (zipEntry.getName().endsWith(".zip"))
{
Path tempZipFile = Files.createTempFile(null, ".zip");
try (InputStream errorStream = zipFile.getInputStream(zipEntry))
{
Files.copy(errorStream, tempZipFile,
StandardCopyOption.REPLACE_EXISTING);
}
String[] logsFromZip = getLogBuffers(tempZipFile.toString());
strErrorLogFileContents.write(logsFromZip[0]);
strWindowLogFileContents.write(logsFromZip[1]);
Files.delete(tempZipFile);
}
}
}
Finally, consider creating a meaningful class for your return value. An array of Strings is difficult to understand. A caller won’t know that it always contains exactly two elements and won’t know what those two elements are. A custom return type would be pretty short:
public class Logs {
private final String errorLog;
private final String windowLog;
public Logs(String errorLog,
String windowLog)
{
this.errorLog = errorLog;
this.windowLog = windowLog;
}
public String getErrorLog()
{
return errorLog;
}
public String getWindowLog()
{
return windowLog;
}
}
As of Java 16, you can use a record to make the declaration much shorter:
public record Logs(String errorLog,
String windowLog)
{ }
Whether you use a record or write out the class, you can use it as a return type in your method:
public static Logs getLogBuffers(String path) throws IOException
{
// ...
return new Logs(
strErrorLogFileContents.toString(),
strWindowLogFileContents.toString());
}
* The Windows explorer shell’s practice of treating zip files as folders is a pretty bad user interface. I know I’m not the only one who thinks so. It often ends up making things more difficult for users instead of easier.

Need to Merge Avro Files using java application

I am having multiple avro files under a directory which reside on hadoop environment, I need to merge all these files and make it as a single avro file.
example
/abc->
x.avro
y.avro } => a.avro
z.avro
The file a.avro will contain contents of all x,y,z files, where x,y,z files having same schema. I need to create a java application. Any help appreciated.
Thanks.
There are few tools provided by the apache avro in order to deal with the avro file operations here. These tools include Merging/Concat tool which merge same schema avro file with non-reserved metadata, catTool to extract samples from an Avro data file, conversion tool which Converts an input file from Avro binary into JSON, recoveryTool which Recovers data from a corrupt Avro Data file etc(Find more on the github url mentioned).
I have extract the code from the same tools mentioned on github, here is the java application that does solve your purpose.
Path inPath = new Path("C:\\Users\\vaijnathp\\IdeaProjects\\MSExcel\\vaj");
Path outPath = new Path("getDestinationPath") ;
FileSystem fs = FileSystem.get(new Configuration());
FileStatus [] contents contents = fs.listStatus(inPath, new OutputLogFilter());
DataFileWriter<GenericRecord> writer = new DataFileWriter<>(new GenericDatumWriter<>());
Schema schema = null;
String inputCodec = null;
Map<String, byte[]> metadata = new TreeMap<>();
BufferedOutputStream output = new BufferedOutputStream(new BufferedOutputStream(fs.create(outPath)));
for (int i = 0; i < contents.length; i++) {
FileStatus folderContent = contents[i];
if (folderContent.isFile() && folderContent.getPath().getName().endsWith(".avro")) {
InputStream input = new BufferedInputStream(fs.open(folderContent.getPath()));
DataFileStream<GenericRecord> reader = new DataFileStream<>(input, new GenericDatumReader<GenericRecord>());
if (schema == null) {
schema = reader.getSchema();
//extract metadata for further check.
extractAvroFileMetadata(writer, metadata, reader);
inputCodec = reader.getMetaString(DataFileConstants.CODEC);
if (inputCodec == null) inputCodec = DataFileConstants.NULL_CODEC;
writer.setCodec(CodecFactory.fromString(inputCodec));
writer.create(schema, output);
} else {
if (!schema.equals(reader.getSchema())) reader.close();
//compare FileMetadata with previously extracted one
CompareAvroFileMetadata(metadata, reader, folderContent.getPath().getName());
String thisCodec = reader.getMetaString(DataFileConstants.CODEC);
if (thisCodec == null) thisCodec = DataFileConstants.NULL_CODEC;
if (!inputCodec.equals(thisCodec)) reader.close();
}
writer.appendAllFrom(reader, false);
reader.close();
}
}
writer.close();
}catch (Exception e){
e.printStackTrace();
}
I hope this code snippet will help you create your java application. Thanks.

Data reading in a jar File (in Java) and counting files

So here is my problem (I read the other answers, but didn't quite get it).
In a group of 4, we have created a game in Java as a University Project. Part of this is creating a *.jar File via Ant. There is several GameBoards saved in GameBoardx.txt Data where x is the number. We want to randomly select one of those. Therefore, every time a GameBoard is loaded, the files in the GameBoard directory are counted in order to generate a random number in the correct range. Our code works perfectly fine when running it from Eclipse. It fails to run from the *.jar File and exits with a NullPointerException.
int number = 0;
int fileCount = new File(new File("").getAbsolutePath()+"/GameBoards/").listFiles().length;
Random rand = new Random();
number = rand.nextInt(fileCount);
These Files are read later on using this:
static String fileName = new File("").getAbsolutePath();
static String line = null;
boolean verticalObstacles[][] = new boolean[16][17];
int currentLine = 1;
try {
FileReader fileReader = new FileReader(fileName+"/GameBoards/Board"+boardNumber+".txt");
BufferedReader bufferedReader = new BufferedReader(fileReader);
while ((line = bufferedReader.readLine()) != null){
if (currentLine <17){
for (int i=0; i<17; i++){
if (line.charAt(i) == '1'){
verticalObstacles[currentLine-1][i] = true;
} else {
verticalObstacles[currentLine-1][i] = false;
}
}
}
currentLine ++;
}
bufferedReader.close();
The rest of the code works with the *.jar File and the *.txt Files are included in it.
The solutions I found were not good for us, because the code has to work with the *.jar File as well as just starting it from Eclipse to pass the test.
What's the solution here to make in work in both?
Problem here is you can not read content of a Jar using File, you shall use java.nio classes to deal with this.
First of all you can read/get count of files from Jar/normal folder by using FileSystem, Path and FileVisitor classes:
Following code will work for both jar as well as IDE
ClassLoader sysClassLoader = ClassLoader.getSystemClassLoader();
URI uri = sysClassLoader.getResource("GameBoards").toURI();
Path gameBoardPath = null;
if (uri.getScheme().equals("jar")) {
FileSystem fileSystem = FileSystems.newFileSystem(uri,
Collections.<String, Object> emptyMap());
gameBoardPath = fileSystem.getPath("/GameBoards");
} else {
gameBoardPath = Paths.get(uri);
}
PathVisitor pathVistor = new PathVisitor();
Files.walkFileTree(gameBoardPath, pathVistor);
System.out.println(pathVistor.getFileCount());
Following is the code for PathVisitor class
class PathVisitor extends SimpleFileVisitor<Path> {
private int fileCount = 0;
#Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs)
throws IOException {
fileCount++;
return FileVisitResult.CONTINUE;
}
public int getFileCount() {
return fileCount;
}
}
And then you shall read content of specific file by using ClassLoader#getResourceAsStream
// ADD your random file picking logic here based on file Count to get boardNum
int boardNum = 1;
InputStream is = sysClassLoader.getResourceAsStream("GameBoards/Board" + boardNum + ".txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
String line = null;
while((line=reader.readLine())!=null) {
System.out.println(line);
}
Hope this resolves your concerns and helps you in right direction.

Android: Open file with specific path [duplicate]

I have a filename in my code as :
String NAME_OF_FILE="//sdcard//imageq.png";
FileInputStream fis =this.openFileInput(NAME_OF_FILE); // 2nd line
I get an error on 2nd line :
05-11 16:49:06.355: ERROR/AndroidRuntime(4570): Caused by: java.lang.IllegalArgumentException: File //sdcard//imageq.png contains a path separator
I tried this format also:
String NAME_OF_FILE="/sdcard/imageq.png";
The solution is:
FileInputStream fis = new FileInputStream (new File(NAME_OF_FILE)); // 2nd line
The openFileInput method doesn't accept path separators.
Don't forget to
fis.close();
at the end.
This method opens a file in the private data area of the application. You cannot open any files in subdirectories in this area or from entirely other areas using this method. So use the constructor of the FileInputStream directly to pass the path with a directory in it.
openFileInput() doesn't accept paths, only a file name
if you want to access a path, use File file = new File(path) and corresponding FileInputStream
I got the above error message while trying to access a file from Internal Storage using openFileInput("/Dir/data.txt") method with subdirectory Dir.
You cannot access sub-directories using the above method.
Try something like:
FileInputStream fIS = new FileInputStream (new File("/Dir/data.txt"));
You cannot use path with directory separators directly, but you will
have to make a file object for every directory.
NOTE: This code makes directories, yours may not need that...
File file= context.getFilesDir();
file.mkdir();
String[] array=filePath.split("/");
for(int t=0; t< array.length -1 ;t++)
{
file=new File(file,array[t]);
file.mkdir();
}
File f=new File(file,array[array.length-1]);
RandomAccessFileOutputStream rvalue = new RandomAccessFileOutputStream(f,append);
String all = "";
try {
BufferedReader br = new BufferedReader(new FileReader(filePath));
String strLine;
while ((strLine = br.readLine()) != null){
all = all + strLine;
}
} catch (IOException e) {
Log.e("notes_err", e.getLocalizedMessage());
}
File file = context.getFilesDir();
file.mkdir();
String[] array = filePath.split("/");
for(int t = 0; t < array.length - 1; t++) {
file = new File(file, array[t]);
file.mkdir();
}
File f = new File(file,array[array.length- 1]);
RandomAccessFileOutputStream rvalue =
new RandomAccessFileOutputStream(f, append);
I solved this type of error by making a directory in the onCreate event, then accessing the directory by creating a new file object in a method that needs to do something such as save or retrieve a file in that directory, hope this helps!
public class MyClass {
private String state;
public File myFilename;
#Override
protected void onCreate(Bundle savedInstanceState) {//create your directory the user will be able to find
super.onCreate(savedInstanceState);
if (Environment.MEDIA_MOUNTED.equals(state)) {
myFilename = new File(Environment.getExternalStorageDirectory().toString() + "/My Directory");
if (!myFilename.exists()) {
myFilename.mkdirs();
}
}
}
public void myMethod {
File fileTo = new File(myFilename.toString() + "/myPic.png");
// use fileTo object to save your file in your new directory that was created in the onCreate method
}
}
I did like this
var dir = File(app.filesDir, directoryName)
if(!dir.exists()){
currentCompanyFolder.mkdir()
}
var directory = app.getDir(directoryName, Context.MODE_PRIVATE)
val file = File(directory, fileName)
file.outputStream().use {
it.write(body.bytes())
}

Search a text file for List of names in JAVA

I have the following:
Folder that contains many files (about 300000), named "AllFilesFolder"
list of names, named "namesList"
An empty folder, named "filteredFolder"
I want to filter the folder "AllFilesFolder", by moving any file that contins any of the names in the list to the empty folder "filteredFolder".
I have approche this problem by the following code:
public static void doIt(List<String>namesList, String AllFilesFolder, String filteredFolder) throws FileNotFoundException {
// here we put all the files in the original folder in List variable name "filesList"
File[] filesList = new File(AllFilesFolder).listFiles();
// went throught the files one by one
for (File f : filesList) {
try {
FileReader fr = new FileReader(f);
BufferedReader reader = new BufferedReader(fr);
String line = "";
//this varibale used to test withir the files contins names or not
//we set it to false.
boolean goodDoc = false;
//go through the file line by line to chick the names (I wounder if there are a simbler whay)
while ((line = reader.readLine()) != null) {
for(String name:namesList){
if ( line.contains(name)) {
goodDoc = true;
}
}
}
reader.close();
// if this file contains the name we put this file into the other folder "filteredFolder"
if (goodDoc) {
InputStream inputStream = new FileInputStream(f);
OutputStream out = new FileOutputStream(new File(filteredFolder + f.getName()));
int read = 0;
byte[] bytes = new byte[4096];
while ((read = inputStream.read(bytes)) != -1) {
out.write(bytes, 0, read);
}
inputStream.close();
out.flush();
out.close();
}
} catch (Exception e) {
System.err.println(e);
}
}
}
By doing this I have two problems that I need your advice to solve:
I am reading each file twice, one time to search and the other to put it into the other folder.
When searching namesList I have for loop to takes the names one by one, Is there a way to search the list one time (without loop).
Many thanks in advance
I am reading each file twice, one time to search and the other to put it into the other folder.
Using NIO improves the copy performance. Here is the code example. If you can use Java 7 then you can use Files.copy()
When searching namesList I have for loop to takes the names one by one, Is there a way to search the list one time (without loop).
Use HashSet to store the names and use contains() method. It is a O(1) operation. Or another suggestion is to use Scanner.findWithinHorizon(pattern, horizon)

Categories