How to remove byte order mark using scala.io.Source? - java

Byte order mark is making my regex fail when using scala.io.Source to read from a file. This answer is a lightweight solution using java.io. Is there anything similar for scala.io.Source, or will I have to revert back to Java because of a single byte?

Based on Joe K's idea in his comment, and using Andrei Punko's answer for the problem in Java and Alvin Alexander's Scala code, the simplest solution to read a file possibly containing byte order mark into an array of string is:
#throws[IOException]
def skip(reader: Reader): Unit = {
reader.mark(1)
val possibleBOM = new Array[Char](1)
reader.read(possibleBOM)
if (possibleBOM(0) != '\ufeff') reader.reset
}
val br = new BufferedReader(new InputStreamReader(new FileInputStream(file)))
skip(br)
val lines = {
val ls = new ArrayBuffer[String]()
var l: String = null
while ({l= br.readLine; l != null}) {
ls.append(l)
}
br.close
ls.toArray
}

Related

How to add new column when create a csv file in scala

val csv_writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(dest+"whatever"+".csv")))
for (x <-storeTimestamp ) {
csv_writer.write(x + "\n")
}
csv_writer.close()
Is there any other ways to write 2 lists of string to a csv file?
For now in my code, I only wrote 1 list of string to the csv file. How to add a new column from another list?
Is it OK to use Bufferedwriter?
If I understand your question, you have some CSV data and you want to add a new column of data to the end of each line before writing it to a file.
Here's how I might go about it.
// some pretend data
val origCSV = Seq("a,b,c", "d,e,f", "x,y,z")
val newCLMN = Seq("4X", "2W", "9A")
// put them together
val allData = origCSV.zip(newCLMN).map{case (a,b) => s"$a,$b\n"}
Note: zip will only zip the two collections together until it runs out of one or the other. It there's data left in the larger collection then it is ignored. If that's not desirable then you might try zipAll.
On to the file writing.
import java.io.{File, FileWriter}
import util.Try
val writer = Try(new FileWriter(new File("filename.csv")))
writer.map{w => w.write(allData.mkString); w}
.recoverWith{case e => e.printStackTrace(); writer}
.map(_.close())
And the result is...
>> cat filename.csv
a,b,c,4X
d,e,f,2W
x,y,z,9A
>>

JVM char array occupies lots of memory

I have the same problem as asked here JVM Monitor char array memory usage. But I did not get the clear answer from this question and I couldn't add a comment because of low reputation. So, I am asking from here.
I wrote a multithreaded program that calculates word co-ocurrence frequencies. I am reading words lazily from file and making calculations. In the program, I have a map which holds word pairs and their co-ocurrence counts. After finishing the counting operation, I am writing this map to a file.
Here is my problem:
After writing frequency map to a file. The size of a file is for example 3gb. But when program runs, used memory is 35gb ram + 5gb swap area. Then I monitor the jvm and the memory picture is like this: and garbage collector picture is like this: and the parameters overwiew:
How char[] array can occupy this much memory when the output file size is 3gb? Thanks.
Okey, here is the code that causes this problem:
This code is not multithreaded and used for merging two files which contains co-occurred words and their counts. And this code also causes the same memory usage problem and furthermore this code causes lots of gc calls because of high heap space usage,so normal program cannot run because of stop the world garbage collector:
import java.io.{BufferedWriter, File, FileWriter, FilenameFilter}
import java.util.regex.Pattern
import core.WordTuple
import scala.collection.mutable.{Map => mMap}
import scala.io.{BufferedSource, Source}
class PairWordsMerger(path: String, regex: String) {
private val wordsAndCounts: mMap[WordTuple, Int] = mMap[WordTuple, Int]()
private val pattern: Pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE)
private val dir: File = new File(path)
private var sWordAndCount: Array[String] = Array.fill(3)("")
private var tempTuple: WordTuple = WordTuple("","")
private val matchedFiles: Array[File] = dir.listFiles(new FilenameFilter {
override def accept(dir: File, name: String): Boolean = pattern.matcher(name).matches()
})
def merge(): Unit = {
for(fileName <- matchedFiles) {
val file: BufferedSource = Source.fromFile(fileName)
val iter: Iterator[String] = file.getLines()
while(iter.hasNext) {
//here I used split like this because entries in the file
//are hold in this format: word1,word2,frequency
sWordAndCount = iter.next().split(",")
tempTuple = WordTuple(sWordAndCount(0), sWordAndCount(1))
try {
wordsAndCounts += (tempTuple -> (wordsAndCounts.getOrElse(tempTuple, 0) + sWordAndCount(2).toInt))
} catch {
case e: NumberFormatException => println("Cannot parse to int...")
}
}
file.close()
println("One pair words map update done")
}
writeToFile()
}
private def writeToFile(): Unit = {
val f: File = new File("allPairWords.txt")
val out = new BufferedWriter(new FileWriter(f))
for(elem <- wordsAndCounts) {
out.write(elem._1 + "," + elem._2 + "\n")
}
out.close()
}
}
object PairWordsMerger {
def apply(path: String, regex: String): PairWordsMerger = new PairWordsMerger(path, regex)
}

Read faster a file & convert it into HEX

I need to read a file that is in ascii and convert it into hex before applying some functions (search for a specific caracter)
To do this, I read a file, convert it in hex and write into a new file. Then I open my new hex file and I apply my functions.
My issue is that it makes way too much time to read and convert it (approx 8sec for a 9Mb file)
My reading method is :
public static void convertToHex2(PrintStream out, File file) throws IOException {
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
int value = 0;
StringBuilder sbHex = new StringBuilder();
StringBuilder sbResult = new StringBuilder();
while ((value = bis.read()) != -1) {
sbHex.append(String.format("%02X ", value));
}
sbResult.append(sbHex);
out.print(sbResult);
bis.close();
}
Do you have any suggestions to make it faster ?
Did you measure what your actual bottleneck is? Because you seem to read very little amount of data in your loop and process that each time. You might as well read larger chunks of data and process those, e.g. using DataInputStream or whatever. That way you would benefit more from optimized reads of your OS, file system, their caches etc.
Additionally, you fill sbHex and append that to sbResult, to print that somewhere. Looks like an unnecessary copy to me, because sbResult will always be empty in your case and with sbHex you already have a StringBuilder for your PrintStream.
Try this:
static String[] xx = new String[256];
static {
for( int i = 0; i < 256; ++i ){
xx[i] = String.format("%02X ", i);
}
}
and use it:
sbHex.append(xx[value]);
Formatting is a heavy operation: it does not only the coversion - it also has to look at the format string.

How to chain multiple different InputStreams into one InputStream

I'm wondering if there is any ideomatic way to chain multiple InputStreams into one continual InputStream in Java (or Scala).
What I need it for is to parse flat files that I load over the network from an FTP-Server. What I want to do is to take file[1..N], open up streams and then combine them into one stream. So when file1 comes to an end, I want to start reading from file2 and so on, until I reach the end of fileN.
I need to read these files in a specific order, data comes from a legacy system that produces files in barches so data in one depends on data in another file, but I would like to handle them as one continual stream to simplify my domain logic interface.
I searched around and found PipedInputStream, but I'm not positive that is what I need. An example would be helpful.
It's right there in JDK! Quoting JavaDoc of SequenceInputStream:
A SequenceInputStream represents the logical concatenation of other input streams. It starts out with an ordered collection of input streams and reads from the first one until end of file is reached, whereupon it reads from the second one, and so on, until end of file is reached on the last of the contained input streams.
You want to concatenate arbitrary number of InputStreams while SequenceInputStream accepts only two. But since SequenceInputStream is also an InputStream you can apply it recursively (nest them):
new SequenceInputStream(
new SequenceInputStream(
new SequenceInputStream(file1, file2),
file3
),
file4
);
...you get the idea.
See also
How do you merge two input streams in Java? (dup?)
This is done using SequencedInputStream, which is straightforward in Java, as Tomasz Nurkiewicz's answer shows. I had to do this repeatedly in a project recently, so I added some Scala-y goodness via the "pimp my library" pattern.
object StreamUtils {
implicit def toRichInputStream(str: InputStream) = new RichInputStream(str)
class RichInputStream(str: InputStream) {
// a bunch of other handy Stream functionality, deleted
def ++(str2: InputStream): InputStream = new SequenceInputStream(str, str2)
}
}
With that, I can do stream sequencing as follows
val mergedStream = stream1++stream2++stream3
or even
val streamList = //some arbitrary-length list of streams, non-empty
val mergedStream = streamList.reduceLeft(_++_)
Another solution: first create a list of input stream and then create the sequence of input streams:
List<InputStream> iss = Files.list(Paths.get("/your/path"))
.filter(Files::isRegularFile)
.map(f -> {
try {
return new FileInputStream(f.toString());
} catch (Exception e) {
throw new RuntimeException(e);
}
}).collect(Collectors.toList());
new SequenceInputStream(Collections.enumeration(iss)))
Here is a more elegant solution using Vector, this is for Android specifically but use vector for any Java
AssetManager am = getAssets();
Vector v = new Vector(Constant.PAGES);
for (int i = 0; i < Constant.PAGES; i++) {
String fileName = "file" + i + ".txt";
InputStream is = am.open(fileName);
v.add(is);
}
Enumeration e = v.elements();
SequenceInputStream sis = new SequenceInputStream(e);
InputStreamReader isr = new InputStreamReader(sis);
Scanner scanner = new Scanner(isr); // or use bufferedReader
Here's a simple Scala version that concatenates an Iterator[InputStream]:
import java.io.{InputStream, SequenceInputStream}
import scala.collection.JavaConverters._
def concatInputStreams(streams: Iterator[InputStream]): InputStream =
new SequenceInputStream(streams.asJavaEnumeration)

Convert Stream to String Java/Groovy

I stole this snippet off the web. But it looks to be limited to 4096 bytes and is quite ugly IMO. Anyone know of a better approach? I'm actually using Groovy btw...
String streamToString(InputStream input) {
StringBuffer out = new StringBuffer();
byte[] b = new byte[4096];
for (int n; (n = input.read(b)) != -1;) {
out.append(new String(b, 0, n));
}
return out.toString();
}
EDIT:
I found a better solution in Groovy:
InputStream exportTemplateStream = getClass().getClassLoader().getResourceAsStream("export.template")
assert exportTemplateStream: "[export.template stream] resource not found"
String exportTemplate = exportTemplateStream.text
Some good and fast answers. However I think the best one is Groovy has added a "getText" method to InputStream. So all I had to do was stream.text. And good call on the 4096 comment.
For Groovy
filePath = ... //< a FilePath object
stream = filePath.read() //< InputStream object
// Specify the encoding, and get the String object
//content = stream.getText("UTF-16")
content = stream.getText("UTF-8")
The InputStream class reference
The getText() without encoding, it will use current system encoding, ex ("UTF-8").
Try IOUtils from Apache Commons:
String s = IOUtils.toString(inputStream, "UTF-8");
It's reading the input in chunks of 4096 bytes(4KB), but the size of the actual string is not limited as it keeps reading more and appending it to the SringBuffer.
You can do it fairly easily using the Scanner class:
String streamToSring(InputStream input) {
Scanner s = new Scanner(input);
StringBuilder builder = new StringBuilder();
while (s.hasNextLine()) {
builder.append(s.nextLine() +"\n");
}
return builder.toString();
}
That snippet has a bug: if the input uses a multi-byte character encoding, there's a good chance that a single character will span two reads (and not be convertable). And it also has the semi-bug that it relies on the platform's default encoding.
Instead, use Jakarta Commons IO. In particular, the version of IOUtils.toString() that takes an InputStream and applies an encoding to it.
For future reviewers who have similar problems, please note that both IOUtils from Apache, and Groovy's InputStream.getText() method require the stream to complete, or be closed before returning. If you are working with a persistent stream you will nead to deal with the "ugly" example that Phil originally posted, or work with non-blocking IO.
You can try something similar to this
new FileInputStream( new File("c:/tmp/file.txt") ).eachLine { println it }

Categories