How read big text file and work with it in Java - java

I have a large text file, and i want read it, when i try do it without any operations like add some text from this file to List it read file maximum to one minute but when i try add some text to arrayList and next i want do some operations it is too slowly, do you know how can i read this data and use it?
This is my code:
public class ReaderTEst {
public static void main(String[] args) throws IOException {
List<String> graphList = new ArrayList<>();
List<String> edgeList = new ArrayList<>();
FileInputStream inputStream = null;
Scanner sc = null;
try {
inputStream = new FileInputStream("myText.txt");
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
line = line.replace("\uFEFF", "");//i use UTF-8 file so I need delete unneeded character
if (Character.isWhitespace(line.charAt(0))) {
edgeList.add(line.trim());
} else {
graphList.add(line);
}
}
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
}
}
It takes to many time, do you know how it could be faster? I have file txt with 600 MB
When i change :
List<Integer> graphList = new ArrayList<>(1);
int i = 0;
while (sc.hasNextLine()) {`String line = sc.nextLine();`
line = line.replace("\uFEFF", "");//i use UTF-8 file so I need delete unneeded character
graphList.add(i++);
}
i works, but when i want put string it takes too long time

You should use BufferedReader.readLine(). You can read millions of lines per second with that. Scanner is overkill for what you're doing.
BUT \uFEFF is not text. Is this really a text file? Is that a BOM marker? in which case it will only be at the beginning of the first line: no need to scan for it in every line.

Your main issues are the following:
List<String> graphList = new ArrayList<>();
List<String> edgeList = new ArrayList<>();
You should initialize each List with an initial capacity so that the JVM does not need to automatically expand the backing array.
line = line.replace("\uFEFF", "");
This will also slow down your program. How often is \uFEFF in each line? I would check if the line contains \uFEFF before attempting to replace it.
Other than that, there's not much else to optimize; maybe you can utilize a FileChannel to read the file, but that's about it.

First of all I advise to use the LinkedList realization of List because of architectual features. Thus the ArrayList is built-on arrays, the LinkedList consists on Nodes. The ArrayList creates new bigger arrays and copy old one the new one, then it is reach some capasity. Oracle has perfect documentation about this, I recommend it to you LinkedList
ArrayList

Related

Putting a text file into an ArrayList, but if word exist it skips it

I´m in a bit of a struggle here, I´m trying to add each word from a textfile to an ArrayList and every time the reader comes across the same word again it will skip it. (Makes sense?)
I don't even know where to start. I kind of know that I need one loop that adds the textfile to the ArrayList and one the checks if the word is not in the list. Any ideas?
PS: Just started with Java
This is what I've done so far, don't even know if I'm on the right path..
public String findWord(){
int text = 0;
int i = 0;
while sc.hasNextLine()){
wordArray[i] = sc.nextLine();
}
if wordArray[i].contains() {
}
i++;
}
A List (an ArrayList or otherwise) is not the best data structure to use; a Set is better. In pseudo code:
define a Set
for each word
if adding to the set returns false, skip it
else do whatever do want to do with the (first time encountered) word
The add() method of Set returns true if the set changed as a result of the call, which only happens if the word isn't already in the set, because sets disallow duplicates.
I once made a similar program, it read through a textfile and counted how many times a word came up.
Id start with importing a scanner, as well as a file system(this needs to be at the top of the java class)
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.File;
import java.io.PrintStream;
import java.util.Scanner;
then you can make file, as well as a scanner reading from this file, make sure to adjsut the path to the file accordingly. The new Printstream is not necessary but when dealing with a big amount of data i dont like to overflow the console.
public static void main(String[] args) throws FileNotFoundException {
File file=new File("E:/Youtube analytics/input/input.txt");
Scanner scanner = new Scanner(file); //will read from the file above
PrintStream out = new PrintStream(new FileOutputStream("E:/Youtube analytics/output/output.txt"));
System.setOut(out);
}
after this you can use scanner.next() to get the next word so you would write something like this:
String[] array=new String[MaxAmountOfWords];//this will make an array
int numberOfWords=0;
String currentWord="";
while(scanner.hasNext()){
currentWord=scanner.next();
if(isNotInArray(currentWord))
{
array[numberOfWords]=currentWord
}
numberOfWords++;
}
If you dont understand any of this or need further guidence to progress, let me know. It is hard to help you if we dont exactly know where you are at...
You can try this:
public List<String> getAllWords(String filePath){
String line;
List<String> allWords = new ArrayList<String>();
BufferedReader reader = new BufferedReader(new FileReader(new File(filePath)));
//read each line of the file
while((line = reader.readLine()) != null) {
//get each word in the line
for(String word: line.split("(\\w)+"))
//validate if the current word is not empty
if(!word.isEmpty())
if(!allWords.contains(word))
allWords.add(word);
}
}
return allWords;
}
Best solution is to use a Set. But if you still want to use a List, here goes:
Suppose the file has the following data:
Hi how are you
I am Hardi
Who are you
Code will be:
List<String> list = new ArrayList<>();
// Get the file.
FileInputStream fis = new FileInputStream("C:/Users/hdinesh/Desktop/samples.txt");
//Construct BufferedReader from InputStreamReader
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
String line = null;
// Loop through each line in the file
while ((line = br.readLine()) != null) {
// Regex for finding just the words
String[] strArray = line.split("[ ]");
for (int i = 0; i< strArray.length; i++) {
if (!list.contains(strArray[i])) {
list.add(strArray[i]);
}
}
}
br.close();
System.out.println(list.toString());
If your text file has sentences with special characters, you will have to write a regex for that.

Parsing a large text file into chunks in Java

I would like to receive some suggestions regarding a little problem I am going to solve in Java.
I have a file consisting in this format:
#
some text
some text
some text
#
some text
some text
some text
#
some text
some text
some text
...and so on.
I would need to read the next chunk of this text file, then to create an InputStream object consting of the read chunk and to pass the InputStream object to a parser. I have to repeat these operations for every chunk in the text file. Each chunk is written between the lines starting with #. The problem is to parse each section between the # tags using a parser which should read each chunk from an InputStream.
The text file could be big, so I would like to obtain good performance.
How could I solve this problem?
I have thought about doing something like this:
FileReader fileReader = new FileReader(file);
BufferedReader bufferedReader = new BufferedReader(fileReader);
Scanner scanner = new Scanner(bufferedReader);
scanner.useDelimiter("#");
List<ParsedChunk> parsedChunks = new ArrayList<ParsedChunk>();
ChunkParser parser = new ChunkParser();
while(scanner.hasNext())
{
String text = scanner.next();
InputStream inputStream = new ByteArrayInputStream(text.getBytes("UTF-8"));
ParsedChunk parsedChunk = parser.parse(inputStream);
parsedChunks.add(parsedChunk);
inputStream.close();
}
scanner.close();
but I am not sure if it would be a good way to do it.
Thank you.
If I have understood correctly. This is what you are trying to achieve. FYI you will need JAVA 7 to get the below code running
public static void main(String[] args) throws IOException {
List<String> allLines = Files.readAllLines(new File("d:/input.txt").toPath(), Charset.defaultCharset());
List<List<String>> chunks = getChunks(allLines);
//Now you have all te chunks and you can process them
}
private static List<List<String>> getChunks(List<String> allLines) {
List<List<String>> result = new ArrayList<List<String>>();
int i = 0;
int fromIndex = 1;
int toIndex = 0;
for(String line : allLines){
i++;
if(line.startsWith("****") && i != 1){ // To skip the first line and the check next delimiter
toIndex = i-1;
result.add(allLines.subList(fromIndex, toIndex));
fromIndex = i;
}
}
return result;
}
didnt quite get the question but u could try using char at this moment as, storing all the character in char array & going thhrough a loop & condiional statement which breaks the string every time it encounters a'#'

BufferedReader: read multiple lines into a single string

I'm reading numbers from a txt file using BufferedReader for analysis. The way I'm going about this now is- reading a line using .readline, splitting this string into an array of strings using .split
public InputFile () {
fileIn = null;
//stuff here
fileIn = new FileReader((filename + ".txt"));
buffIn = new BufferedReader(fileIn);
return;
//stuff here
}
public String ReadBigStringIn() {
String line = null;
try { line = buffIn.readLine(); }
catch(IOException e){};
return line;
}
public ProcessMain() {
initComponents();
String[] stringArray;
String line;
try {
InputFile stringIn = new InputFile();
line = stringIn.ReadBigStringIn();
stringArray = line.split("[^0-9.+Ee-]+");
// analysis etc.
}
}
This works fine, but what if the txt file has multiple lines of text? Is there a way to output a single long string, or perhaps another way of doing it? Maybe use while(buffIn.readline != null) {}? Not sure how to implement this.
Ideas appreciated,
thanks.
You are right, a loop would be needed here.
The usual idiom (using only plain Java) is something like this:
public String ReadBigStringIn(BufferedReader buffIn) throws IOException {
StringBuilder everything = new StringBuilder();
String line;
while( (line = buffIn.readLine()) != null) {
everything.append(line);
}
return everything.toString();
}
This removes the line breaks - if you want to retain them, don't use the readLine() method, but simply read into a char[] instead (and append this to your StringBuilder).
Please note that this loop will run until the stream ends (and will block if it doesn't end), so if you need a different condition to finish the loop, implement it in there.
I would strongly advice using library here but since Java 8 you can do this also using streams.
try (InputStreamReader in = new InputStreamReader(System.in);
BufferedReader buffer = new BufferedReader(in)) {
final String fileAsText = buffer.lines().collect(Collectors.joining());
System.out.println(fileAsText);
} catch (Exception e) {
e.printStackTrace();
}
You can notice also that it is pretty effective as joining is using StringBuilder internally.
If you just want to read the entirety of a file into a string, I suggest you use Guava's Files class:
String text = Files.toString("filename.txt", Charsets.UTF_8);
Of course, that's assuming you want to maintain the linebreaks. If you want to remove the linebreaks, you could either load it that way and then use String.replace, or you could use Guava again:
List<String> lines = Files.readLines(new File("filename.txt"), Charsets.UTF_8);
String joined = Joiner.on("").join(lines);
Sounds like you want Apache IO FileUtils
String text = FileUtils.readStringFromFile(new File(filename + ".txt"));
String[] stringArray = text.split("[^0-9.+Ee-]+");
If you create a StringBuilder, then you can append every line to it, and return the String using toString() at the end.
You can replace your ReadBigStringIn() with
public String ReadBigStringIn() {
StringBuilder b = new StringBuilder();
try {
String line = buffIn.readLine();
while (line != null) {
b.append(line);
line = buffIn.readLine();
}
}
catch(IOException e){};
return b.toString();
}
You have a file containing doubles. Looks like you have more than one number per line, and may have multiple lines.
Simplest thing to do is read lines in a while loop.
You could return null from your ReadBigStringIn method when last line is reached and terminate your loop there.
But more normal would be to create and use the reader in one method. Perhaps you could change to a method which reads the file and returns an array or list of doubles.
BTW, could you simply split your strings by whitespace?
Reading a whole file into a single String may suit your particular case, but be aware that it could cause a memory explosion if your file was very large. Streaming approach is generally safer for such i/o.
This creates a long string, every line is seprateted from string " " (one space):
public String ReadBigStringIn() {
StringBuffer line = new StringBuffer();
try {
while(buffIn.ready()) {
line.append(" " + buffIn.readLine());
} catch(IOException e){
e.printStackTrace();
}
return line.toString();
}

what is the efficent way to process larges text files?

I have two files:
1- with 1400000 line or record --- 14 MB
2- with 16000000 -- 170 MB
I want to find if each record or line in file 1 is also in file 2 or not
I develop a java app that do the following: Read file line by line and pass each line to a method that loop in file 2
Here is my code:
public boolean hasIDin(String bioid) throws Exception {
BufferedReader br = new BufferedReader(new FileReader("C://AllIDs.txt"));
long bid = Long.parseLong(bioid);
String thisLine;
while((thisLine = br.readLine( )) != null)
{
if (Long.parseLong(thisLine) == bid)
return true;
}
return false;
}
public void getMBD() throws Exception{
BufferedReader br = new BufferedReader(new FileReader("C://DIDs.txt"));
OutputStream os = new FileOutputStream("C://MBD.txt");
PrintWriter pr = new PrintWriter(os);
String thisLine;
int count=1;
while ((thisLine = br.readLine( )) != null){
String bioid = thisLine;
System.out.println(count);
if(! hasIDin(bioid))
pr.println(bioid);
count++;
}
pr.close();
}
When I run it seems it will take more 1944.44444444444 hours to complete as every line processing takes 5 sec. That is about three months!
Is there any ideas to make it done in a much much more less time.
Thanks in advance.
Why don't you;
read all the lines in file2 into a set. Set is fine, but TLongHashSet would be more efficient.
for each line in the second file see if it is in the set.
Here is a tuned implementation which prints the following and uses < 64 MB.
Generating 1400000 ids to /tmp/DID.txt
Generating 16000000 ids to /tmp/AllIDs.txt
Reading ids in /tmp/DID.txt
Reading ids in /tmp/AllIDs.txt
Took 8794 ms to find 294330 valid ids
Code
public static void main(String... args) throws IOException {
generateFile("/tmp/DID.txt", 1400000);
generateFile("/tmp/AllIDs.txt", 16000000);
long start = System.currentTimeMillis();
TLongHashSet did = readLongs("/tmp/DID.txt");
TLongHashSet validIDS = readLongsUnion("/tmp/AllIDs.txt",did);
long time = System.currentTimeMillis() - start;
System.out.println("Took "+ time+" ms to find "+ validIDS.size()+" valid ids");
}
private static TLongHashSet readLongs(String filename) throws IOException {
System.out.println("Reading ids in "+filename);
BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
TLongHashSet ids = new TLongHashSet();
for(String line; (line = br.readLine())!=null;)
ids.add(Long.parseLong(line));
br.close();
return ids;
}
private static TLongHashSet readLongsUnion(String filename, TLongHashSet validSet) throws IOException {
System.out.println("Reading ids in "+filename);
BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
TLongHashSet ids = new TLongHashSet();
for(String line; (line = br.readLine())!=null;) {
long val = Long.parseLong(line);
if (validSet.contains(val))
ids.add(val);
}
br.close();
return ids;
}
private static void generateFile(String filename, int number) throws IOException {
System.out.println("Generating "+number+" ids to "+filename);
PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter(filename), 128*1024));
Random rand = new Random();
for(int i=0;i<number;i++)
pw.println(rand.nextInt(1<<26));
pw.close();
}
170Mb + 14Mb is not so huge files.
My suggestion is to load the smallest one file into java.util.Map, parse the biggest one line-by-line (record-by-record) file and check if the current line present in this Map.
P.S. The question looks like something trivial in terms of RDBMS - maybe it's worth to use any?
You can't do an O(N^2) when each iteration is so long, that's completely unacceptable.
If you have enough RAM, you simply parse file 1, create a map of all numbers, then parse file 2 and check your map.
If you don't have enough RAM, parse file 1, create a map and store it to a file, then parse file 2 and read the map. The key is to make the map as easy to parse as possible - make it a binary format, maybe with a binary tree or something where you can quickly skip around and search. (EDIT: I have to add Michael Borgwardt's Grace Hash Join link, which shows an even better way: http://en.wikipedia.org/wiki/Hash_join#Grace_hash_join)
If there is a limit to the size of your files, option 1 is easier to implement - unless you're dealing with huuuuuuuge files (I'm talking lots of GB), you definitely want to do that.
Usually, memory-mapping is the most efficient way to read large files. You'll need to use java.nio.MappedByteBuffer and java.io.RandomAccessFile.
But your search algorithm is the real problem. Building some sort of index or hash table is what you need.

Java: How to read a text file

I want to read a text file containing space separated values. Values are integers.
How can I read it and put it in an array list?
Here is an example of contents of the text file:
1 62 4 55 5 6 77
I want to have it in an arraylist as [1, 62, 4, 55, 5, 6, 77]. How can I do it in Java?
You can use Files#readAllLines() to get all lines of a text file into a List<String>.
for (String line : Files.readAllLines(Paths.get("/path/to/file.txt"))) {
// ...
}
Tutorial: Basic I/O > File I/O > Reading, Writing and Creating text files
You can use String#split() to split a String in parts based on a regular expression.
for (String part : line.split("\\s+")) {
// ...
}
Tutorial: Numbers and Strings > Strings > Manipulating Characters in a String
You can use Integer#valueOf() to convert a String into an Integer.
Integer i = Integer.valueOf(part);
Tutorial: Numbers and Strings > Strings > Converting between Numbers and Strings
You can use List#add() to add an element to a List.
numbers.add(i);
Tutorial: Interfaces > The List Interface
So, in a nutshell (assuming that the file doesn't have empty lines nor trailing/leading whitespace).
List<Integer> numbers = new ArrayList<>();
for (String line : Files.readAllLines(Paths.get("/path/to/file.txt"))) {
for (String part : line.split("\\s+")) {
Integer i = Integer.valueOf(part);
numbers.add(i);
}
}
If you happen to be at Java 8 already, then you can even use Stream API for this, starting with Files#lines().
List<Integer> numbers = Files.lines(Paths.get("/path/to/test.txt"))
.map(line -> line.split("\\s+")).flatMap(Arrays::stream)
.map(Integer::valueOf)
.collect(Collectors.toList());
Tutorial: Processing data with Java 8 streams
Java 1.5 introduced the Scanner class for handling input from file and streams.
It is used for getting integers from a file and would look something like this:
List<Integer> integers = new ArrayList<Integer>();
Scanner fileScanner = new Scanner(new File("c:\\file.txt"));
while (fileScanner.hasNextInt()){
integers.add(fileScanner.nextInt());
}
Check the API though. There are many more options for dealing with different types of input sources, differing delimiters, and differing data types.
This example code shows you how to read file in Java.
import java.io.*;
/**
* This example code shows you how to read file in Java
*
* IN MY CASE RAILWAY IS MY TEXT FILE WHICH I WANT TO DISPLAY YOU CHANGE WITH YOUR OWN
*/
public class ReadFileExample
{
public static void main(String[] args)
{
System.out.println("Reading File from Java code");
//Name of the file
String fileName="RAILWAY.txt";
try{
//Create object of FileReader
FileReader inputFile = new FileReader(fileName);
//Instantiate the BufferedReader Class
BufferedReader bufferReader = new BufferedReader(inputFile);
//Variable to hold the one line data
String line;
// Read file line by line and print on the console
while ((line = bufferReader.readLine()) != null) {
System.out.println(line);
}
//Close the buffer reader
bufferReader.close();
}catch(Exception e){
System.out.println("Error while reading file line by line:" + e.getMessage());
}
}
}
Look at this example, and try to do your own:
import java.io.*;
public class ReadFile {
public static void main(String[] args){
String string = "";
String file = "textFile.txt";
// Reading
try{
InputStream ips = new FileInputStream(file);
InputStreamReader ipsr = new InputStreamReader(ips);
BufferedReader br = new BufferedReader(ipsr);
String line;
while ((line = br.readLine()) != null){
System.out.println(line);
string += line + "\n";
}
br.close();
}
catch (Exception e){
System.out.println(e.toString());
}
// Writing
try {
FileWriter fw = new FileWriter (file);
BufferedWriter bw = new BufferedWriter (fw);
PrintWriter fileOut = new PrintWriter (bw);
fileOut.println (string+"\n test of read and write !!");
fileOut.close();
System.out.println("the file " + file + " is created!");
}
catch (Exception e){
System.out.println(e.toString());
}
}
}
Just for fun, here's what I'd probably do in a real project, where I'm already using all my favourite libraries (in this case Guava, formerly known as Google Collections).
String text = Files.toString(new File("textfile.txt"), Charsets.UTF_8);
List<Integer> list = Lists.newArrayList();
for (String s : text.split("\\s")) {
list.add(Integer.valueOf(s));
}
Benefit: Not much own code to maintain (contrast with e.g. this). Edit: Although it is worth noting that in this case tschaible's Scanner solution doesn't have any more code!
Drawback: you obviously may not want to add new library dependencies just for this. (Then again, you'd be silly not to make use of Guava in your projects. ;-)
Use Apache Commons (IO and Lang) for simple/common things like this.
Imports:
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang3.ArrayUtils;
Code:
String contents = FileUtils.readFileToString(new File("path/to/your/file.txt"));
String[] array = ArrayUtils.toArray(contents.split(" "));
Done.
Using Java 7 to read files with NIO.2
Import these packages:
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
This is the process to read a file:
Path file = Paths.get("C:\\Java\\file.txt");
if(Files.exists(file) && Files.isReadable(file)) {
try {
// File reader
BufferedReader reader = Files.newBufferedReader(file, Charset.defaultCharset());
String line;
// read each line
while((line = reader.readLine()) != null) {
System.out.println(line);
// tokenize each number
StringTokenizer tokenizer = new StringTokenizer(line, " ");
while (tokenizer.hasMoreElements()) {
// parse each integer in file
int element = Integer.parseInt(tokenizer.nextToken());
}
}
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
To read all lines of a file at once:
Path file = Paths.get("C:\\Java\\file.txt");
List<String> lines = Files.readAllLines(file, StandardCharsets.UTF_8);
All the answers so far given involve reading the file line by line, taking the line in as a String, and then processing the String.
There is no question that this is the easiest approach to understand, and if the file is fairly short (say, tens of thousands of lines), it'll also be acceptable in terms of efficiency. But if the file is long, it's a very inefficient way to do it, for two reasons:
Every character gets processed twice, once in constructing the String, and once in processing it.
The garbage collector will not be your friend if there are lots of lines in the file. You're constructing a new String for each line, and then throwing it away when you move to the next line. The garbage collector will eventually have to dispose of all these String objects that you don't want any more. Someone's got to clean up after you.
If you care about speed, you are much better off reading a block of data and then processing it byte by byte rather than line by line. Every time you come to the end of a number, you add it to the List you're building.
It will come out something like this:
private List<Integer> readIntegers(File file) throws IOException {
List<Integer> result = new ArrayList<>();
RandomAccessFile raf = new RandomAccessFile(file, "r");
byte buf[] = new byte[16 * 1024];
final FileChannel ch = raf.getChannel();
int fileLength = (int) ch.size();
final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, 0,
fileLength);
int acc = 0;
while (mb.hasRemaining()) {
int len = Math.min(mb.remaining(), buf.length);
mb.get(buf, 0, len);
for (int i = 0; i < len; i++)
if ((buf[i] >= 48) && (buf[i] <= 57))
acc = acc * 10 + buf[i] - 48;
else {
result.add(acc);
acc = 0;
}
}
ch.close();
raf.close();
return result;
}
The code above assumes that this is ASCII (though it could be easily tweaked for other encodings), and that anything that isn't a digit (in particular, a space or a newline) represents a boundary between digits. It also assumes that the file ends with a non-digit (in practice, that the last line ends with a newline), though, again, it could be tweaked to deal with the case where it doesn't.
It's much, much faster than any of the String-based approaches also given as answers to this question. There is a detailed investigation of a very similar issue in this question. You'll see there that there's the possibility of improving it still further if you want to go down the multi-threaded line.
read the file and then do whatever you want
java8
Files.lines(Paths.get("c://lines.txt")).collect(Collectors.toList());

Categories