Splitting a text file into multiple files by specific character sequence

Splitting a text file into multiple files by specific character sequence - java

I have a file with the following format.
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...
".I 1" indicate the beginning of chunk of text corresponding to doc ID1 and ".I 2" indicates the beginning of chunk of text corresponding to doc ID2.
what I need is read the text between ".I 1" and ".I 2" and save it as a separate file like "DOC_ID_1.txt" and then read the text between ".I 2" and ".I 3"
and save it as a separate file like "DOC_ID_2.txt" and so on. lets assume that the number of .I # is not known.
I have tried this but cannot finish it. any help will be appreciated
String inputDocFile="C:\\Dropbox\\Data\\cran.all.1400";
try {
File inputFile = new File(inputDocFile);
FileReader fileReader = new FileReader(inputFile);
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line=null;
String outputDocFileSeperatedByID="DOC_ID_";
//Pattern docHeaderPattern = Pattern.compile(".I ", Pattern.MULTILINE | Pattern.COMMENTS);
ArrayList<ArrayList<String>> result = new ArrayList<> ();
int docID =0;
try {
StringBuilder sb = new StringBuilder();
line = bufferedReader.readLine();
while (line != null) {
if (line.startsWith(".I"))
{
result.add(new ArrayList<String>());
result.get(docID).add(".I");
line = bufferedReader.readLine();
while(line != null && !line.startsWith(".I")){
line = bufferedReader.readLine();
}
++docID;
}
else line = bufferedReader.readLine();
}
} finally {
bufferedReader.close();
}
} catch (IOException ex) {
Logger.getLogger(ReadFile.class.getName()).log(Level.SEVERE, null, ex);
}

You want to find the lines which match "I n".
The regex you need is : ^.I \d$
^ indicates the beginning of the line. Hence, if there are some whitespaces or text before I, the line will not match the regex.
\d indicates any digit. For the sake of simplicty, I allow only one digit in this regex.
$ indicates the end of the line. Hence, if there are some characters after the digit, the line will not match the expression.
Now, you need to read the file line by line and keep a reference to the file in which you write the current line.
Reading a file line by line is much easier in Java 8 with Files.lines();
private String currentFile = "root.txt";
public static final String REGEX = "^.I \\d$";
public void foo() throws Exception{
Path path = Paths.get("path/to/your/input/file.txt");
Files.lines(path).forEach(line -> {
if(line.matches(REGEX)) {
//Extract the digit and update currentFile
currentFile = "File DOC_ID_"+line.substring(3, line.length())+".txt";
System.out.println("Current file is now : currentFile);
} else {
System.out.println("Writing this line to "+currentFile + " :" + line);
//Files.write(...);
}
});
Note : In order to extract the digit, I use a raw "".substring() which I consider as evil but it is easier to understand. You can do it in a better way with a Pattern and a Matcher :
With this regex : ".I (\\d)". (The same as before but with parenthesis which indicates what you will want to capture). Then :
Pattern pattern = Pattern.compile(".I (\\d)");
Matcher matcher = pattern.matcher(".I 3");
if(matcher.find()) {
System.out.println(matcher.group(1));//display "3"
}

Look up regex, Java has inbuilt libraries for this.
https://docs.oracle.com/javase/tutorial/essential/regex/
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
These links will give you a starting point, effectively you can use counter to perform a pattern match against the string and store anything between the first pattern match and the second pattern match. This information can be output to a separate file using the Formatter class.
Found here:-
http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
public class Test {
/**
* #param args
* #throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String inputFile="C:\\logs\\test.txt";
BufferedReader br = new BufferedReader(new FileReader(new File(inputFile)));
String line=null;
StringBuilder sb = new StringBuilder();
int count=1;
try {
while((line = br.readLine()) != null){
if(line.startsWith(".I")){
if(sb.length()!=0){
File file = new File("C:\\logs\\DOC_ID_"+count+".txt");
PrintWriter writer = new PrintWriter(file, "UTF-8");
writer.println(sb.toString());
writer.close();
sb.delete(0, sb.length());
count++;
}
continue;
}
sb.append(line);
}
} catch (Exception ex) {
ex.printStackTrace();
}
finally {
br.close();
}
}
}

Related

java string StringTokenizer doesn't recognize token after "//"?

i am writing a code where i want to print only comments in a java file , it worked when i have a comments like this
// a comment
but when i have a comment like this :
// /* cdcdf
it will not print "/* cdcdf" , it only prints a blank line
anyone know why this happens ?
here is my code :
package printC;
import java.io.*;
import java.util.StringTokenizer;
import java.lang.String ;
public class PrintComments {
public static void main(String[] args) {
try {
String line;
BufferedReader br = new BufferedReader(new FileReader(args[0]));
while ((line = br.readLine()) != null) {
if (line.contains("//") ) {
StringTokenizer st1 = new StringTokenizer(line, "//");
if(!(line.startsWith("//"))) {
st1.nextToken();
}
System.out.println(st1.nextToken());
}
}
}catch (Exception e) {
System.out.println(e);
}
}
}

You can simplify the code by just looking for the first position of the //. indexOf works fine for this. You don't need to tokenize as you really just want everything after a certain position (or text), you don't need to split the line into multiple pieces.
If you find the // (indexOf doesn't return -1 for "not found"), you use substring to only print the characters starting at that position.
This minimal example should do what you want:
import java.io.*;
import java.util.StringTokenizer;
public class PrintComments {
public static void main(String[] args) throws IOException {
String line; // comment
BufferedReader br = new BufferedReader(new FileReader(args[0]));
while ((line = br.readLine()) != null) {
int commentStart = line.indexOf("//");
if (commentStart != -1) {
System.out.println(line.substring(commentStart));
}
} // /* that's it
}
}
If you don't want to print the //, just add 2 to commentStart.
Note that this primitive approach to parsing for comments is very brittle. If you run the program on its own source, it will happily report //"); as well, for the line of the indexOf. Any serious attempt to find comments need to properly parse the source code.
Edit: If you want to look for other comments marked by /* and */ as well, do the same thing for the opening comment, then look for the closing comment at the end of the line. This will find a /* comment */ when all of the comment is on a single line. When it sees the opening /* it looks whether the line ends with a closing */ and if so, uses substring again to only pick the parts between the comment markers.
import java.io.*;
import java.util.StringTokenizer;
public class PrintComments {
public static void main(String[] args) throws IOException {
String line; // comment
BufferedReader br = new BufferedReader(new FileReader(args[0]));
while ((line = br.readLine()) != null) {
int commentStart;
String comment = null;
commentStart = line.indexOf("//");
if (commentStart != -1) {
comment = line.substring(commentStart + 2);
}
commentStart = line.indexOf("/*");
if (commentStart != -1) {
comment = line.substring(commentStart + 2);
if (comment.endsWith("*/")) {
comment = comment.substring(0, comment.length() - 2);
}
}
if (comment != null) {
System.out.println(comment);
}
} // /* that's it
/* test */
}
}
To extend this for comments that span multiple lines, you need to remember whether you're in a multi-line comment, and if you are keep printing line and checking for the closing */.

StringTokenizer takes a collection of delimiters, not a single string delimiter. so it is splitting on the '/' char. the "second" token is the empty token between the two initial "//".
If you just want the rest of the line after the "//", you could use:
if(line.startsWith("//")) {
line = line.substring(2);
}

Additional to #jtahlborn answer. You can check all of the token by iterating token:
e.g:
...
StringTokenizer st1 = new StringTokenizer(line, "//");
while (st1.hasMoreTokens()){
System.out.println("token found:" + st1.nextToken());
}
...

If you are reading per line, the StringTokenizer don't do much in your code. Try this, change the content of if like this:
if(line.trim().startWith("//")){//true only if líne start with //,aka: comment line
//Do stuff with líne
String cleanLine = line.trim().replace("//"," ");//to remove all // in line
String cleanLine = línea.trim().substring(2,línea.trim().lenght());//to remove only the first //
}
Note: try to always use the trim() to remove all Blanc spaces at begin and end of string.
To split the líne per // use:
líne.split("//")
For more general purpose,check out :
Java - regular expression finding comments in code

how to split one text into multiple text files

I have the following Text:
1
(some text)
/
2
(some text)
/
.
.
/
8519
(some text)
and I want to split this text into several text-files where each file has the name of the number before the text i.e. (1.txt, 2.txt) and so on, and the content of this file will be the text.
I tried this code
BufferedReader br = new BufferedReader(new FileReader("(Path)\\doc.txt"));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
// sb.append(System.lineSeparator());
line = br.readLine();
}
String str = sb.toString();
String[] arrOfStr = str.split("/");
for (int i = 0; i < arrOfStr.length; i++) {
PrintWriter writer = new PrintWriter("(Path)" + arrOfStr[i].charAt(0) + ".txt", "UTF-8");
writer.println(arrOfStr[i].substring(1));
writer.close();
}
System.out.println("Done");
} finally {
br.close();
}
this code works for files 1-9. However, things go wrong for files 10-8519 since I took the first number in the string (arrOfStr [i].charAt(0)) I know my solution is insufficient any suggestions?

In addition to my comment, considering there isn't a space between the leading integer and the first word, the substring at the first space doesn't work.
This question/answer has a few options that should help, the one using regex (\d+) being the simplest one imo, and copied below.
Matcher matcher = Pattern.compile("\\d+").matcher(arrOfStr[i]);
matcher.find();
int yourNumber = Integer.valueOf(matcher.group());
Given a string find the first embedded occurrence of an integer

As you mentioned, the problem is that you only take the first digit. You could enumerate the first characters until you find a non digit character ( arrOfStr[i].charAt(j) <'0' || arrOfStr[i].charAt(j) > '9' ) but it shoud be easier to user a Scanner and an appropriate regexp.
int index = new Scanner(arrOfStr[i]).useDelimiter("\\D+").nextInt();
The delimiter is precisely any group of non-digit character

Here is a quick solution for the given problem. You can test and do proper exception handling.
package practice;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import java.util.List;
public class FileNioTest {
public static void main(String[] args) {
Path path = Paths.get("C:/Temp/readme.txt");
try {
List<String> contents = Files.readAllLines(path);
StringBuffer sb = new StringBuffer();
String folderName = "C:/Temp/";
String fileName = null;
String previousFileName = null;
// Read from the stream
for (String content : contents) {// for each line of content in contents
if (content.matches("-?\\d+")) { // check if it is a number (based on your requirement)
fileName = folderName + content + ".txt"; // create a file name with path
if (sb != null && sb.length() > 0) { // this means if content present to write in the file
writeToFile(previousFileName, sb); // write to file
sb.setLength(0); // clearing buffer
}
createFile(fileName); // create a new file if number found in the line
previousFileName = fileName; // store the name to write content in previous opened file.
} else {
sb.append(content); // keep storing the content to write in the file.
}
System.out.println(content);// print the line
}
if (sb != null && sb.length() > 0) {
writeToFile(fileName, sb);
sb.setLength(0);
}
} catch (IOException ex) {
ex.printStackTrace();// handle exception here
}
}
private static void createFile (String fileName) {
Path newFilePath = Paths.get(fileName);
if (!Files.exists(newFilePath)) {
try {
Files.createFile(newFilePath);
} catch (IOException e) {
System.err.println(e);
}
}
}
private static void writeToFile (String fileName, StringBuffer sb) {
try {
Files.write(Paths.get(fileName), sb.toString().getBytes(), StandardOpenOption.APPEND);
}catch (IOException e) {
System.err.println(e);
}
}
}

How to apply regex to entire file, not just line after line?

I want to apply my regular expression not just to the first line of the text file, but to the all lines together.
Currently it matches only when the entire appropriate match is on one line. And if the appropriate match continues on the next line - it doesn't match at all.
class Parser {
public static void main(String[] args) throws IOException {
Pattern patt = Pattern.compile("(include|"
+ "integrate|"
+ "driven based on|"
+ "facilitate through|"
+ "contain|"
+ "using|"
+ "equipped"
+ "integrate|"
+ "implement|"
+ "utilized to facilitate|"
+ "comprise){1}"
+ "[\\s\\w\\,\\(\\)\\;\\:]*\\."); //Regex
BufferedReader r = new BufferedReader(new FileReader("E:/test/test.txt")); // read the file
String line;
PrintWriter pWriter = null;
while ((line = r.readLine()) != null) {
Matcher matcher = patt.matcher(line);
while (matcher.find()) {
try{
pWriter = new PrintWriter(new BufferedWriter(new FileWriter("E:/test/test1.txt", true)));//append any given input
pWriter.println(matcher.group()); //write the result of matcher to the new file
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
if (pWriter != null){
pWriter.flush();
pWriter.close();
}
}
System.out.println(matcher.group());
}
}
}
}

Change while ((line = r.readLine()) != null) to this:
String file = ""; // Basically, a conglomerate of all of the lines in the file
while ((line = r.readLine()) != null) {
file += line; // Append each line to the "file" string
}
Matcher matcher = patt.matcher(file);
while (matcher.find()) {
/* Blah blah blah, your outputting goes here. */
}
The reason why this happens is because you're doing each line individually. For what you want, you need to apply the regex to the file all at once.

Currently the matcher is applied per line, it needs to be applied to the whole file to work as intended.
Regex are greedy, you will match the whole String on the first match unless you have . (or other special characters) in your String:
...
+ "comprise){1}"
+ "[\\s\\w\\,\\(\\)\\;\\:]*\\."); //Regex
On the last line you match any whitespace and word, so pretty much anything but .. Also the {1} and most of the \ are superfluous (because in []):
...
+ "comprise)"
+ "[\\s\\w,();:]*\\."); //Regex
If you don't care about the newline characters just remove them first and it should work (I see no way around it if you have something like "com\nprise" and want to match that):
s = s.replaceAll("\\n+", "");

Java String Matching in a Sorted File and grouping similar data

i have sorted file and i need to do the following pattern match. I read the row and then compare or do patern match with the row just after it , if it matches then insert the string i used to match after a comma in that row and move on to the next row. I am new to Java and overwhelmed with options from Open CSV to BufferedReader. I intend to iterate through the file till it reaches the end. I may always have blanks and have a dated in quotes. The file size would be around 100 MBs.
My file has data like
ABCD
ABCD123
ABCD456, 123
XYZ
XYZ890
XYZ123, 890
and output is expected as
ABCD, ABCD
ABCD123, ABCD
ABCD456, 123, ABCD
XYZ, XYZ
XYZ890, XYZ
XYZ123, 890, XYZ
Not sure about the best method. Can you please help me.

To open a file, you can use File and FileReader classes:
File csvFile = new File("file.csv");
FileReader fileReader = null;
try {
fileReader = new FileReader(csvFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
You can get a line of the file using Scanner:
Scanner reader = new Scanner(fileReader);
while(reader.hasNext()){
String line = reader.nextLine();
parseLine(line);
}
You want to parse this line. For it, you have to study Regex for using Pattern and Matcher classes:
private void parseLine(String line) {
Matcher matcher = Pattern.compile("(ABCD)").matcher(line);
if(matcher.find()){
System.out.println("find: " + matcher.group());
}
}
To find the next pattern of the same row, you can reuse matcher.find(). If some result was found, it will return true and you can get this result with matcher.groud();

Read line by line and use regex to replace it as per your need using String.replaceAll()
^([A-Z]+)([0-9]*)(, [0-9]+)?$
Replacement : $1$2$3, $1
Here is Online demo
Read more about Java Pattern
Sample code:
String regex = "^([A-Z]+)([0-9]*)(, [0-9]+)?$";
String replacement = "$1$2$3, $1";
String newLine = line.replaceAll(regex,replacement);
For better performance, read 100 or more lines at a time and store in a buffer and finally call String#replaceAll() single time to replace all at a time.
sample code:
String regex = "([A-Z]+)([0-9]*)(, [0-9]+)?(\r?\n|$)";
String replacement = "$1$2$3, $1$4";
StringBuilder builder = new StringBuilder();
int counter = 0;
String line = null;
try (BufferedReader reader = new BufferedReader(new FileReader("abc.csv"))) {
while ((line = reader.readLine()) != null) {
builder.append(line).append(System.lineSeparator());
if (counter++ % 100 == 0) { // 100 lines
String newLine = builder.toString().replaceAll(regex, replacement);
System.out.print(newLine);
builder.setLength(0); // reset the buffer
}
}
}
if (builder.length() > 0) {
String newLine = builder.toString().replaceAll(regex, replacement);
System.out.print(newLine);
}
Read more about Java 7 - The try-with-resources Statement

in java, how to print entire line in the file when string match found

i am having text file called "Sample.text". It contains multiple lines. From this file, i have search particular string.If staring matches or found in that file, i need to print entire line . searching string is in in middle of the line . also i am using string buffer to append the string after reading the string from text file.Also text file is too large size.so i dont want to iterate line by line. How to do this

You could do it with FileUtils from Apache Commons IO
Small sample:
StringBuffer myStringBuffer = new StringBuffer();
List lines = FileUtils.readLines(new File("/tmp/myFile.txt"), "UTF-8");
for (Object line : lines) {
if (String.valueOf(line).contains("something")) {
myStringBuffer.append(String.valueOf(line));
}
}

we can also use regex for string or pattern matching from a file.
Sample code:
import java.util.regex.*;
import java.io.*;
/**
* Print all the strings that match a given pattern from a file.
*/
public class ReaderIter {
public static void main(String[] args) throws IOException {
// The RE pattern
Pattern patt = Pattern.compile("[A-Za-z][a-z]+");
// A FileReader (see the I/O chapter)
BufferedReader r = new BufferedReader(new FileReader("file.txt"));
// For each line of input, try matching in it.
String line;
while ((line = r.readLine()) != null) {
// For each match in the line, extract and print it.
Matcher m = patt.matcher(line);
while (m.find()) {
// Simplest method:
// System.out.println(m.group(0));
// Get the starting position of the text
int start = m.start(0);
// Get ending position
int end = m.end(0);
// Print whatever matched.
// Use CharacterIterator.substring(offset, end);
System.out.println(line.substring(start, end));
}
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting a text file into multiple files by specific character sequence - java

Related

java string StringTokenizer doesn't recognize token after "//"?

how to split one text into multiple text files

How to apply regex to entire file, not just line after line?

Java String Matching in a Sorted File and grouping similar data

in java, how to print entire line in the file when string match found

Categories

Resources