How to split a byte array that contains multiple "lines" in Java?

How to split a byte array that contains multiple "lines" in Java? - java

Say we have a file like so:
one
two
three
(but this file got encrypted)
My crypto method returns the whole file in memory, as a byte[] type.
I know byte arrays don't have a concept of "lines", that's something a Scanner (for example) could have.
I would like to traverse each line, convert it to string and perform my operation on it but I don't know
how to:
Find lines in a byte array
Slice the original byte array to "lines" (I would convert those slices to String, to send to my other methods)
Correctly traverse a byte array, where each iteration is a new "line"
Also: do I need to consider the different OS the file might have been composed in? I know that there is some difference between new lines in Windows and Linux and I don't want my method to work only with one format.
Edit: Following some tips from answers here, I was able to write some code that gets the job done. I still wonder if this code is worthy of keeping or I am doing something that can fail in the future:
byte[] decryptedBytes = doMyCrypto(fileName, accessKey);
ByteArrayInputStream byteArrInStrm = new ByteArrayInputStream(decryptedBytes);
InputStreamReader inStrmReader = new InputStreamReader(byteArrInStrm);
BufferedReader buffReader = new BufferedReader(inStrmReader);
String delimRegex = ",";
String line;
String[] values = null;
while ((line = buffReader.readLine()) != null) {
values = line.split(delimRegex);
if (Objects.equals(values[0], tableKey)) {
return values;
}
}
System.out.println(String.format("No entry with key %s in %s", tableKey, fileName));
return values;
In particular, I was advised to explicitly set the encoding but I was unable to see exactly where?

If you want to stream this, I'd suggest:
Create a ByteArrayInputStream to wrap your array
Wrap that in an InputStreamReader to convert binary data to text - I suggest you explicitly specify the text encoding being used
Create a BufferedReader around that to read a line at a time
Then you can just use:
String line;
while ((line = bufferedReader.readLine()) != null)
{
// Do something with the line
}
BufferedReader handles line breaks from all operating systems.
So something like this:
byte[] data = ...;
ByteArrayInputStream stream = new ByteArrayInputStream(data);
InputStreamReader streamReader = new InputStreamReader(stream, StandardCharsets.UTF_8);
BufferedReader bufferedReader = new BufferedReader(streamReader);
String line;
while ((line = bufferedReader.readLine()) != null)
{
System.out.println(line);
}
Note that in general you'd want to use try-with-resources blocks for the streams and readers - but it doesn't matter in this case, because it's just in memory.

As Scott states i would like to see what you came up with so we can help you alter it to fit your needs.
Regarding your last comment about the OS; if you want to support multiple file types you should consider making several functions that support those different file extensions. As far as i know you do need to specify which file and what type of file you are reading with your code.

Related

ByteArrayOutputStream to String Array

I'm writing an application which has a method that will download a text file from my server. This text file will contain ~1,000 proxy IP's. The download will happen every 10 minutes. I need to find the most efficient way of doing this.
Currently I have a method in a class called Connection which will return the bytes of whatever I want to retrieve. So if I make a connection to the server for the text file using such method, I will get it returned in bytes. My other method will create a very long string from these bytes. After, I split the long string into an array using System.LineSeparator. Here is the code:
public static void fetchProxies(String url) {
Connection c = new Connection();
List<Proxy> tempProxy = new ArrayList<Proxy>();
ByteArrayOutputStream baos =
c.requestBytes(url);
String line = new String(baos.toByteArray());
String[] split = line.split(System.lineSeparator());
//more code to come but the above works fine.
}
This currently works but I know that it isn't the most efficient way. I
My Problem
Instead of turning the bytes into a very long string, what is the most efficient way of turning the bytes into my IP's so I can add each individual IP into an arraylist and then return the arraylist full of IP's?

The most efficient and logical way would be to create a BufferedReader wrapping an InputStreamReader wrapping the InputStream of the URL connection. You would the use readLine() on the BufferedReader until it returns null, and append each line read to the list of IP addresses:
List<String> ipList = new ArrayList<>();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), theAppropriateEncoding))) {
String line;
while ((line = reader.readLine()) != null) {
ipList.add(line);
}
}
Note that this probably won't change much in the performance of the method, though, because most of the time is being spend in waiting fof bytes coming from the remote host, which is considerably slower than building and splitting a String in memory.

split method from String isn't the fastest way to separate all the IP's. There ara other libraries to achive this in an more optimized way.
Read this: http://demeranville.com/battle-of-the-tokenizers-delimited-text-parser-performance/
There is a very nice time comparision about 7 different ways to split a String.
For example a the Splitter class from Guava library returns an Iterable, and with Guava you could also convert the results as List:
import com.google.common.base.Splitter;
...
public static void fetchProxies(String url) {
Connection c = new Connection();
List<Proxy> tempProxy = new ArrayList<Proxy>();
ByteArrayOutputStream baos =
c.requestBytes(url);
String line = new String(baos.toByteArray());
Iterator<Element> myIterator =
Splitter.on(System.getProperty("line.separator")).split(line);
List<Element> myList = Lists.newArrayList(myIterator);
// do somethjing with the List...

How to split a very long string

I have big file (about 30mb) and here the code I use to read data from the file
BufferedReader br = new BufferedReader(new FileReader(file));
try {
String line = br.readLine();
while (line != null) {
sb.append(line).append("\n");
line = br.readLine();
}
Then I need to split the content I read, so I use
String[] inst = sb.toString().split("GO");
The problem is that sometimes the sub-string is over the maximum String length and I can't get all the data inside the string. How can I get rid of this?
Thanks

Scanner s = new Scanner(input).useDelimiter("GO"); and use s.next()

WHY PART:- The erroneous result may be the outcome of non contiguous heap segment as the CMS collector doesn't de-fragment memory.
(It does not answer your how to solve part though).
You may opt for loading the whole string partwise, i.e using substring

Special characters from txt file

I am downloading a text file from ftp, with common ftp library.
The problem is when i read the file into an array line by line, it doesnt take characters such as æøå. Instead it just show the "?" character.
Here is my code
FileInputStream fstream = openFileInput("name of text file");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream, "UTF-8"));
String strLine;
ArrayList<String> lines = new ArrayList<String>();
while ((strLine = br.readLine()) != null) {
lines.add(strLine);
}
String[] linjer = lines.toArray(new String[0]);
ArrayList<String> imei = new ArrayList<String>();
for(int o=0;o<linjer.length;o++)
{
String[] holder = linjer[o].split(" - ");
imei.add(holder[0] + " - " + holder[2]);
}
String[] imeinr = imei.toArray(new String[0]);
I have tried to put UTF-8 in my inputstreamreader, and i have tried with a UnicodeReader class, but with no success.
I am fairly new to Java, so might just be some stupid question, but hope you can help. :)

There is no reason to use a DataInputStream. The DataInputStream and DataOutputStream classes are used for serializing primitive Java data types ("serializing" means reading/writing data to a file). You are just reading the contents of a text file line by line, so the use of DataInputStream is unnecessary and may produce incorrect results.
FileInputStream fstream = openFileInput("name of text file");
//DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(fstream, "UTF-8"));
Professional Java Programmer Tip: The foreach loop was recently added to the Java programming language. It allows the programmer to iterate through the contents of an array without needing to define a loop counter. This simplifies your code, making it easier to read and maintain over time.
for(String line : linjer){
String[] holder = line.split(" - ");
imei.add(holder[0] + " - " + holder[2]);
}
Note: Foreach loops can also be used with List objects.

I would suggest that the file may not be in UTF-8. It could be in CP1252 or something, especially if you're using Windows.
Try downloading the file and running your code on the local copy to see if that works.

FTP has two modes binary and ascii. Make sure you are using the correct mode. Look here for details: http://www.rhinosoft.com/newsletter/NewsL2008-03-18.asp

Check line for unprintable characters while reading text file

My program must read text files - line by line.
Files in UTF-8.
I am not sure that files are correct - can contain unprintable characters.
Is possible check for it without going to byte level?
Thanks.

Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.
E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):
String line;
try (
InputStream fis = new FileInputStream("the_file_name");
InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(isr);
) {
while ((line = br.readLine()) != null) {
// Deal with the line
}
}

While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:
List<String> lines = Files.readLines(file, Charsets.UTF_8);
You can then do whatever you like with those lines.
EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.

Just found out that with the Java NIO (java.nio.file.*) you can easily write:
List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
System.out.println(line);
}
instead of dealing with FileInputStreams and BufferedReaders...

If you want to check a string has unprintable characters you can use a regular expression
[^\p{Print}]

How about below:
FileReader fileReader = new FileReader(new File("test.txt"));
BufferedReader br = new BufferedReader(fileReader);
String line = null;
// if no more lines the readLine() returns null
while ((line = br.readLine()) != null) {
// reading lines until the end of the file
}
Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html

I can find following ways to do.
private static final String fileName = "C:/Input.txt";
public static void main(String[] args) throws IOException {
Stream<String> lines = Files.lines(Paths.get(fileName));
lines.toArray(String[]::new);
List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
readAllLines.forEach(s -> System.out.println(s));
File file = new File(fileName);
Scanner scanner = new Scanner(file);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}

The answer by #T.J.Crowder is Java 6 - in java 7 the valid answer is the one by #McIntosh - though its use of Charset for name for UTF -8 is discouraged:
List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }
Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}

If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

Unable to read the content from a non-empty InputStream

I have a piece of code that reads the content from a non-empty InputStream. However, it works fine in Eclipse and using ant script in my computer, but it fails in an another computer, the result is an empty String, I have checked, the the InputStream is not null. The inputstream is reading a local file, and the file is not empty.
Here are the two different ways I have tried, both of them return an empty String:
Way 1:
StringBuilder aStringBuilder = new StringBuilder();
String strLine = null;
BufferedReader aBufferedReaders = new BufferedReader(new InputStreamReader(anInputStream, "UTF-8"));
while ((strLine = aBufferedReaders.readLine()) != null)
{
aStringBuilder.append(strLine);
}
return aStringBuilder.toString()
Way 2:
StringBuffer buffer = new StringBuffer();
byte[] b = new byte[4096];
for (int n; (n = theInputStream.read(b)) != -1;)
{
buffer.append(new String(b, 0, n));
}
String str = buffer.toString();
return str;
Thanks in advance!

The input stream can be non-null but still empty - and if no exceptions are being thrown but an empty string is being returned, then the input stream is empty. You should look at the code which is opening the input stream in the first place - the code to read from the stream isn't the source of the error, although you need to decide which encoding you're trying to read, and use that appropriately. (The first code looks better to me, explicitly using UTF-8 and using an InputStreamReader for text conversion.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to split a byte array that contains multiple "lines" in Java? - java

Related

ByteArrayOutputStream to String Array

How to split a very long string

Special characters from txt file

Check line for unprintable characters while reading text file

Unable to read the content from a non-empty InputStream

Categories

Resources