Using Hadoop to find files that contain a particular string

Using Hadoop to find files that contain a particular string - java

I have around 1000 files and each file is of the size of 1GB. And I need to find a String in all these 1000 files and also which files contains that particular String. I am working with Hadoop File System and all those 1000 files are in Hadoop File System.
All the 1000 files are under real folder, so If I do like this below, I will be getting all the 1000 files. And I need to find which files contains a particular String hello under real folder.
bash-3.00$ hadoop fs -ls /technology/dps/real
And this is my data structure in hdfs-
row format delimited
fields terminated by '\29'
collection items terminated by ','
map keys terminated by ':'
stored as textfile
How I can write MapReduce jobs to do this particular problem so that I can find which files contains a particular string? Any simple example will be of great help to me.
Update:-
With the use of grep in Unix I can solve the above problem scenario, but it is very very slow and it takes lot of time to get the actual output-
hadoop fs -ls /technology/dps/real | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep cec7051a1380a47a4497a107fecb84c1 >/dev/null && echo $f; done
So that is the reason I was looking for some MapReduce jobs to do this kind of problem...

It sounds like you're looking for a grep-like program, which is easy to implement using Hadoop Streaming (the Hadoop Java API would work too):
First, write a mapper that outputs the name of the file being processed if the line being processed contains your search string. I used Python, but any language would work:
#!/usr/bin/env python
import os
import sys
SEARCH_STRING = os.environ["SEARCH_STRING"]
for line in sys.stdin:
if SEARCH_STRING in line.split():
print os.environ["map_input_file"]
This code reads the search string from the SEARCH_STRING environmental variable. Here, I split the input line and check whether the search string matches any of the splits; you could change this to perform a substring search or use regular expressions to check for matches.
Next, run a Hadoop streaming job using this mapper and no reducers:
$ bin/hadoop jar contrib/streaming/hadoop-streaming-*.jar \
-D mapred.reduce.tasks=0
-input hdfs:///data \
-mapper search.py \
-file search.py \
-output /search_results \
-cmdenv SEARCH_STRING="Apache"
The output will be written in several parts; to obtain a list of matches, you can simply cat the files (provided they aren't too big):
$ bin/hadoop fs -cat /search_results/part-*
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/ivy.xml
hdfs://localhost/data/README.txt
...

To get the filename you are currently processing, do:
((FileSplit) context.getInputSplit()).getPath().getName()
When you are searching your file record by record, when you see hello, emit the above path (and maybe the line or anything else).
Set the number of reducers to 0, they aren't doing anything here.
Does 'row format delimited' mean that lines are delimited by a newline? in which case TextInputFormat and LineRecordReader work fine here.

You can try something like this, though I'm not sure if it's an efficient way to go about it. Let me know if it works - I haven't tested it or anything.
You can use it like this: java SearchFiles /technology/dps/real hello making sure you run it from the appropriate directory of course.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;
public class SearchFiles {
public static void main(String[] args) throws IOException {
if (args.length < 2) {
System.err.println("Usage: [search-dir] [search-string]");
return;
}
File searchDir = new File(args[0]);
String searchString = args[1];
ArrayList<File> matches = checkFiles(searchDir.listFiles(), searchString, new ArrayList<File>());
System.out.println("These files contain '" + searchString + "':");
for (File file : matches) {
System.out.println(file.getPath());
}
}
private static ArrayList<File> checkFiles(File[] files, String search, ArrayList<File> acc) throws IOException {
for (File file : files) {
if (file.isDirectory()) {
checkFiles(file.listFiles(), search, acc);
} else {
if (fileContainsString(file, search)) {
acc.add(file);
}
}
}
return acc;
}
private static boolean fileContainsString(File file, String search) throws IOException {
BufferedReader in = new BufferedReader(new FileReader(file));
String line;
while ((line = in.readLine()) != null) {
if (line.contains(search)) {
in.close();
return true;
}
}
in.close();
return false;
}
}

Related

7zip cmd extracting using ProcessBuilder in java

I have a problem extracting an archive to the desired category using Java 10 ProcessBuilder and 7z.exe (18.05) with command line. The exact same command works as intended when I use Windows CMD, but no longer functions when issued by my JavaFX application using ProcessBuilder:
public static void decompress7ZipEmbedded(File source, File destination) throws IOException, InterruptedException {
ProcessBuilder pb = new ProcessBuilder(
getSevenZipExecutablePath(),
EXTRACT_WITH_FULL_PATHS_COMMAND,
quotifyPath(source.getAbsolutePath()),
OUTPUT_DIRECTORY_SWITCH + quotifyPath(destination.getAbsolutePath())
);
processWithSevenZipEmbedded(pb);
}
private static void processWithSevenZipEmbedded(ProcessBuilder pb) throws IOException, InterruptedException {
LOG.info("7-zip command issued: " + String.join(" ", pb.command()));
Process p = pb.start();
new Thread(new InputConsumer(p.getInputStream())).start();
System.out.println("Exited with: " + p.waitFor());
}
public static class InputConsumer implements Runnable {
private InputStream is;
InputConsumer(InputStream is) {
this.is = is;
}
#Override
public void run() {
try {
int value = -1;
while ((value = is.read()) != -1) {
System.out.print((char) value);
}
} catch (IOException exp) {
exp.printStackTrace();
}
LOG.debug("Output stream completed");
}
}
public static String getSevenZipExecutablePath() {
return FileUtil.quotifyPath(getDirectory() + "7z" + "/" + "7z");
}
public static String quotifyPath(String path) {
return '"' + path + '"';
}
public class Commands {
public static final String EXTRACT_COMMAND = "e";
public static final String EXTRACT_WITH_FULL_PATHS_COMMAND = "x";
public static final String PACK_COMMAND = "a";
public static final String DELETE_COMMAND = "d";
public static final String BENCHMARK_COMMAND = "b";
public static final String LIST_COMMAND = "l";
}
public class Switches {
public static final String OUTPUT_DIRECTORY_SWITCH = "-o";
public static final String RECURSIVE_SWITCH = "-r";
public static final String ASSUME_YES = "y";
}
The command looks like this:
"C:/Users/blood/java_projects/AppRack/target/classes/7z/7z" x "D:\Pulpit\AppRack Sandbox\test\something\Something 2\Something2.7z" -o"D:\Pulpit\AppRack Sandbox\Something2"
And the output from ProcessBuilder:
7-Zip 18.05 (x64) : Copyright (c) 1999-2018 Igor Pavlov : 2018-04-30
Scanning the drive for archives:
1 file, 59177077 bytes (57 MiB)
Extracting archive: D:\Pulpit\AppRack Sandbox\test\Something\Something 2\Something2.7z
--
Path = D:\Pulpit\AppRack Sandbox\test\Something\Something 2\Something2.7z
Type = 7z
Physical Size = 5917Exited with: 0
7077
Headers Size = 373
Method = LZMA2:26 LZMA:20 BCJ2
Solid = +
Blocks = 2
No files to process
Everything is Ok
Files: 0
Size: 0
Compressed: 59177077
It doesn't do ANYTHING. Doesn't create a desired folder, nothing. Using CMD it works like a charm (here log from Windows 10 CMD using the same command):
7-Zip 18.05 (x64) : Copyright (c) 1999-2018 Igor Pavlov : 2018-04-30
Scanning the drive for archives:
1 file, 59177077 bytes (57 MiB)
Extracting archive: D:\Pulpit\AppRack Sandbox\test\Something\Something 2\Something2.7z
--
Path = D:\Pulpit\AppRack Sandbox\test\Something\Something 2\Something2.7z
Type = 7z
Physical Size = 59177077
Headers Size = 373
Method = LZMA2:26 LZMA:20 BCJ2
Solid = +
Blocks = 2
Everything is Ok
Folders: 1
Files: 5
Size: 64838062
Compressed: 59177077
Do you have any idea what causes a difference here and why it says "No files to process, everything is ok" without doing anything? I've tried already to create a folder first using File class but it doesn't seem to be an issue because the results are the same whether the destination folder exists prior to extracting or not.
I've already tried everything that has come to my mind and I run out of ideas at the moment. Please share with me any suggestions that you may have regarding this issue. Thanks a lot.

Thank you very much for your help.
Don’t quote your arguments. Quotes are for the command shell’s benefit. ProcessBuilder is not a command shell; it executes a command directly, so any quotes are seen as part of the argument itself (that is, the file name). Also, pb.inheritIO(); is a better way to see the output of the child process than manually consuming process streams.
Thank you #VGR it seemed to be the issue - after I remove the method to quote paths in the mentioned command it works like a charm and extracting archive without any problem! So the conclusion is I shouldn't have used quotes in paths while using Java ProcessBuilder.
I've also used pb.inheritIO() and you are right it is much better and easier to manage it this way.
public static void decompress7ZipEmbedded(File source, File destination) throws IOException {
ProcessBuilder pb = new ProcessBuilder().inheritIO().command(
getSevenZipExecutablePath(),
EXTRACT_WITH_FULL_PATHS_COMMAND,
source.getAbsolutePath(),
OUTPUT_DIRECTORY_SWITCH + destination.getAbsolutePath(),
OVERWRITE_WITHOUT_PROMPT
);
processWithSevenZipEmbedded(pb);
}
private static void processWithSevenZipEmbedded(ProcessBuilder pb) throws IOException {
LOG.info("7-zip command issued: " + String.join(" ", pb.command()));
pb.start();
}
public class Commands {
public static final String EXTRACT_WITH_FULL_PATHS_COMMAND = "x";
}
public class Switches {
public static final String OUTPUT_DIRECTORY_SWITCH = "-o";
public static final String OVERWRITE_WITHOUT_PROMPT = "-aoa";
}
Double click on file 7zip.chm or start 7-Zip and open the Help and read the help page Command Line Version - Syntax with first line 7z [...] [...]. There is clearly explained that first the command x must be specified, next should be the switches like -o with best last switch being --, then the archive file name and last further arguments like names of files/folders to extract. Switches can be also specified after archive file name, but that is not recommended although examples on help page for -o are also with -o at end.
Thank you #Mofi for the tip. I used -aoa switch instead of -y and it finally started to work as I wanted - to overwrite files without any prompt. I left the rest of the command the way it was as it works as intended, so it finally looks like this:
C:/Users/blood/java_projects/AppRack/target/classes/7z/7z" x D:\Pulpit\AppRack Sandbox\test\Test\Test 2\Test.7z -oD:\Desktop\AppRack Sandbox\Test 2 -aoa
Thanks a lot for help once again!

Beansheel sampler is not stopping after execution

I am printing the responses in a csv file using beanshell sampler but it is not stopping after completion.
What can be done so that it stops after printing it. Below is the sample code I have used acctId is used in the pre processor from other thread group.
import java.io.FileWriter;
import java.util.Arrays;
import java.io.Writer;
import java.util.List;
char SEPARATOR = ',';
public void writeLine(FileWriter writer, String[] params, char separator)
{
boolean firstParam = true;
StringBuilder stringBuilder = new StringBuilder();
String param = "";
for (int i = 0; i <params.length; i++)
{
param = params[i];
log.info(param);
if (!firstParam)
{
stringBuilder.append(separator);
}
stringBuilder.append(param);
firstParam = false;
}
stringBuilder.append("\n");
log.info(stringBuilder.toString());
writer.append(stringBuilder.toString());
}
String csvFile = "D:/jmeter/test1/result.csv"; // for example '/User/Downloads/blabla.csv'
//String[] params = {"${acctId}", "${tranId}"};
String[] params = {"${acctId}"};
FileWriter fileWriter = new FileWriter(csvFile, true);
writeLine(fileWriter, params, SEPARATOR);
fileWriter.flush();
fileWriter.close();

You can use "View Result Tree" Sampler or "Simple Data Writer" to save the response messages. Just click "Configure" and use save as XML and select "Save response data(XML)" with other required fields. Thought, it is not recommended for load test.

The recommended way of saving a variable into a CSV file is using Sample Variables property
Add the next line to user.properties file (lives in "bin" folder of your JMeter installation)
sample_variables=acctId
Restart JMeter to pick the property up
That's it now when you run your JMeter test in command-line non-GUI mode like:
jmeter -n -t test.jmx -l result.jtl
You will see an extra column in the result.jtl file holding the value of the acctId variable for each sampler.
Also be aware that starting from JMeter 3.1 it is recommended to use Groovy for any form of scripting. You will be able to replace your code with something like:
new File('D:/jmeter/test1/result.csv') << vars.get('acctId') << System.getProperty('line.separator')
If you don't like Groovy syntax be aware that you can use FileUtils.writeStringToFile() function

Java - Download sequence file in Hadoop

I have problem to copy the binary files (which is store as sequence files in Hadoop) to my local machine. The problem is that the binary file I downloaded from hdfs was not the original binary file I generated when I'm running map-reduce tasks. I Googled similar problems and I guess the issue is that when I copy the sequence files to my local machine, I got the header of the sequence file plus the original file.
My question is: Is there any way to avoid download the header but still preserve my original binary file?
There are two ways I can think about:
I can transform the binary file into some other format like Text so that I can avoid using SequenceFile. After I do copyToLocal, I transform it back to binary file.
I still use the sequence file. But when I generate the binary file, I also generate some meta information about the corresponding sequence file (e.g. the length of the header and the original length of the file). And after I do copyToLocal, I use the downloaded binary file (which contains header, etc.) along with the meta information to recover my original binary file.
I don't know which one is feasible. Could anyone give me a solution? Could you also show me some sample code for the solution you give?
I highly appreciate your help.

I found a workaround for this question. Since downloading sequence file will give you header and other magic word in the binary file, the way I avoid this problem is to transform my original binary file into Base64 String and store it as Text in HDFS and when downloading the encoded binary files, I decode it back to my original binary file.
I know this will take extra time but currently I don't find any other solution to this problem. The hard part to directly remove headers and other magic words in the sequence file is that Hadoop may insert some word "Sync" in between my binary file.
If anyone have a better solution to this problem, I'd be very happy to hear about that. :)

Use a MapReduce Code to read the SequenceFile and use the SequenceFileInputFormat as InputFileFormat to read the Sequence File in HDFS. This would split the file as Key Value pairs and the value would have only the binary file contents which you can use to create your binary file.
Here is a code snippet to split a sequence file that is made of multiple images and split that into individual binary files and write it into local file system.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class CreateOrgFilesFromSeqFile {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
if (args.length !=2){
System.out.println("Incorrect No of args (" + args.length + "). Expected 2 args: <seqFileInputPath> <outputPath>");
System.exit(-1);
}
Path seqFileInputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "CreateSequenceFile");
job.setJarByClass(M4A6C_CreateOrgFilesFromSeqFile.class);
job.setMapperClass(CreateOrgFileFromSeqFileMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, seqFileInputPath);
FileOutputFormat.setOutputPath(job, outputPath);
//Delete the existing output File
outputPath.getFileSystem(conf).delete(outputPath, true);
System.exit(job.waitForCompletion(true)? 0 : -1);
}
}
class CreateOrgFileFromSeqFileMapper extends Mapper<Text, BytesWritable, NullWritable, Text>{
#Override
public void map(Text key, BytesWritable value, Context context) throws IOException, InterruptedException{
Path outputPath = FileOutputFormat.getOutputPath(context);
FileSystem fs = outputPath.getFileSystem(context.getConfiguration());
String[] filePathWords = key.toString().split("/");
String fileName = filePathWords[filePathWords.length-1];
System.out.println("outputPath.toString()+ key: " + outputPath.toString() + "/" + fileName + "value length : " + value.getLength());
try(FSDataOutputStream fdos = fs.create(new Path(outputPath.toString() + "/" + fileName)); ){
fdos.write(value.getBytes(),0,value.getLength());
fdos.flush();
}
//System.out.println("value: " + value + ";\t baos.toByteArray().length: " + baos.toByteArray().length);
context.write(NullWritable.get(), new Text(outputPath.toString() + "/" + fileName));
}
}

Search for a String in a Text File using Java 8

I have a long text file that I want to read and extract some data out of it. Using JavaFX and FXML, I am using FileChooser to load the file to get the file path.
My controller.java has the following:
private void handleButtonAction(ActionEvent event) throws IOException {
FileChooser fileChooser = new FileChooser();
FileChooser.ExtensionFilter extFilter = new FileChooser.ExtensionFilter("TXT files (*.txt)", "*.txt");
fileChooser.getExtensionFilters().add(extFilter);
File file = fileChooser.showOpenDialog(stage);
System.out.println(file);
stage = (Stage) button.getScene().getWindow();
}
Sample of text file: Note some of the file content is split between 2 lines. for Example -Ba\ 10.10.10.3 is part of the first line.
net ip-interface create 10.10.10.2 255.255.255.128 MGT-1 -Ba \
10.10.10.3
net ip-interface create 192.168.1.1 255.255.255.0 G-1 -Ba \
192.168.1.2
net route table create 10.10.10.5 255.255.255.255 10.10.10.1 -i \
MGT-1
net route table create 10.10.10.6 255.255.255.255 10.10.10.1 -i \
MGT-1
I am looking for a way to search this (file) and output the following:
MGT-1 ip-interface 10.10.10.2
MGT-1 Backup ip-interface 10.10.10.3
G-1 ip-interface 192.168.1.1
G-1 Backup Ip-interface 192.168.1.2
MGT-1 route 10.10.10.5 DFG 10.10.10.1
MGT-1 route 10.10.10.6 DFG 10.10.10.1

Of course you can read the input file as the stream of lines using BufferedReader.lines or Files.lines. However the tricky thing here is how to deal with the trailing "\". There are several possible solutions. You may write your own Reader which wraps an existing Reader and just ignores the slash followed by EOL. Alternatively you can write a custom Iterator or Spliterator which takes the BufferedReader.lines stream as the input and handles this case. I'd suggest to use my StreamEx library which already has a method for such tasks called collapse:
StreamEx.ofLines(reader).collapse((a, b) -> a.endsWith("\\"),
(a, b) -> a.substring(0, a.length()-1).concat(b));
The first argument is the predicate which is applied for two adjacent lines and should return true if lines should be merged. The second argument is the function which actually merges two lines (we chop the slash via substring, then concatenate the next line).
Now you can just split the line by the whitespace and convert it to one or two output lines according to your task. Better to do it by the separate method. The whole code:
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.regex.Pattern;
import java.util.stream.Stream;
import javax.util.streamex.StreamEx;
public class ParseFile {
static Stream<String> convertLine(String[] fields) {
switch(fields[1]) {
case "ip-interface":
return Stream.of(fields[5]+" "+fields[1]+" "+fields[3],
fields[5]+" Backup "+fields[1]+" "+fields[7]);
case "route":
return Stream.of(fields[8]+" route "+fields[4]+" DFG "+fields[6]);
default:
throw new IllegalArgumentException("Unrecognized input: "+
String.join(" ", fields));
}
}
static Stream<String> convert(Reader reader) {
return StreamEx.ofLines(reader)
.collapse((a, b) -> a.endsWith("\\"),
(a, b) -> a.substring(0, a.length()-1).concat(b))
.map(Pattern.compile("\\s+")::split)
.flatMap(ParseFile::convertLine);
}
public static void main(String[] args) throws IOException {
try(Reader r = new InputStreamReader(
ParseFile.class.getResourceAsStream("test.txt"))) {
convert(r).forEach(System.out::println);
}
}
}

How to write a method that accesses the system?

My program is almost working but the second array in my main isn't displaying anything. I can't figure out why. Here is my code.
package myutilites;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Map;
import java.util.Scanner;
import java.util.StringTokenizer;
public class extraCredit
{
public static void main(String args [])
{
ArrayList<String> array = system("ls -l");
for(String s: array)
System.out.println(s);
ArrayList<String> array1 = system("ls -l *.java");
for(String a: array1)
System.out.println(a);
}
public static ArrayList<String> system(String string)
{
ArrayList<String> array = new ArrayList<String>();
ArrayList<String> infoArray = new ArrayList<String>();
String s = string;
StringTokenizer tok = new StringTokenizer(s,"\\,: ");
while(tok.hasMoreTokens())
{
array.add(tok.nextToken());
}
for (String a : array)
System.out.println(a);
try
{
ProcessBuilder pb = new ProcessBuilder(array);
Map<String, String> env = pb.environment();
env.put("VAR1", "myValue");
env.remove("OTHERVAR");
env.put("VAR2", env.get("VAR1") + "suffix");
pb.directory();
Process p = pb.start();
Scanner c = new Scanner(p.getInputStream());
while(c.hasNext())
{
infoArray.add(c.nextLine());
}
c.close();
} catch (IOException e) {}
return infoArray;
}
}
my output is this. the ls -l *.java doesn't work.
ls
-l
total 0
drwxr-xr-x 8 brianhammons staff 272 Sep 10 09:44 bin
drwxr-xr-x 10 brianhammons staff 340 Sep 9 10:04 src
ls
-l
*.java

Your input string "ls -l .java" will not return anything unless you are in the directory containing Java source code. As it is, it is looking for a directory ".java" which cannot be found. You need to pass in the full path into the system function

It depends what you are trying to do. If you want to find every *.java file on your system then use: "locate *.java".
This will find every instance of a file or directory with "java" in the name icluding source code and system file. Use "locate --help" to find a list of options to use with the locate command.
If you want to find just you source code then you will need to loop through the directory structure looking for your source directories (normally src) and then extract the files.
If you just want to search one specific directory then you must use the command: "cd ".
EG cd /home/michael/Java.
It would help to learn some basic Bash commands. One of the best sources is http://bash.cyberciti.biz/guide/Main_Page.
If you can be more specific about what you are trying to do I would be happy to help

I like using Java to do systems work (I know, it's a bit odd, but that's just me). However, in your case, I would highly recommend not using it for this task.
Java already has an excellent ability to access the file system details, through the File, Path, and FileSystem interfaces. Unless you really need something that is truly exotic, you'll get better results through those interfaces than the results you can hand parse and put together yourself.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using Hadoop to find files that contain a particular string - java

Related

7zip cmd extracting using ProcessBuilder in java

Beansheel sampler is not stopping after execution

Java - Download sequence file in Hadoop

Search for a String in a Text File using Java 8

How to write a method that accesses the system?

Categories

Resources