I have two very large CSV files that will only continue to get larger with time. The documents I'm using to test are 170 columns wide and roughly 57,000 rows. This is using data from 2018 to now, ideally the end result will be sufficient to run on CSVs with data going as far back as 2008 which will result in the CSVs being massive.
Currently I'm using Univocity, but the creator has been inactive on answering questions for quite some time and their website has been down for weeks, so I'm open to changing parsers if need be.
Right now I have the following code:
public void test() {
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(false);
CsvParser sourceParser = new CsvParser(parserSettings);
sourceParser.beginParsing(sourceFile));
Writer writer = new OutputStreamWriter(new FileOutputStream(outputPath), StandardCharsets.UTF_8);
CsvWriterSettings writerSettings = new CsvWriterSettings();
CsvWriter csvWriter = new CsvWriter(writer, writerSettings);
csvWriter.writeRow(headers);
String[] sourceRow;
String[] compareRow;
while ((sourceRow = sourceParser.parseNext()) != null) {
CsvParser compareParser = new CsvParser(parserSettings);
compareParser.beginParsing(Path.of("src/test/resources/" + compareCsv + ".csv").toFile());
while ((compareRow = compareParser.parseNext()) != null) {
if (Arrays.equals(sourceRow, compareRow)) {
break;
} else {
if (compareRow[KEY_A].trim().equals(sourceRow[KEY_A].trim()) &&
compareRow[KEY_B].trim().equals(sourceRow[KEY_B].trim()) &&
compareRow[KEY_C].trim().equals(sourceRow[KEY_C].trim())) {
for (String[] result : getOnlyDifferentValues(sourceRow, compareRow)) {
csvWriter.writeRow(result);
}
break;
}
}
}
compareParser.stopParsing();
}
}
This all works exactly as I need it to, but of course as you can obviously tell it takes forever. I'm stopping and restarting the parsing of the compare file because order is not guaranteed in these files, so what is in row 1 in the source CSV could be in row 52,000 in the compare CSV.
The Question:
How do I get this faster? Here are my requirements:
Print row under following conditions:
KEY_A, KEY_B, KEY_C are equal but any other column is not equal
Source row is not found in compare CSV
Compare row is not found in source CSV
Presently I only have the first requirement working, but I need to tackle the speed issue first and foremost. Also, if I try to parse the file into memory I immediately run out of heap space and the application laughs at me.
Thanks in advance.
Also, if I try to parse the file into memory I immediately run out of heap space
Have you tried increasing the heap size? You don't say how large your data file is, but 57000 rows * 170 columns * 100 bytes per cell = 1 GB, which should pose no difficulty on a modern hardware. Then, you can keep the comparison file in a HashMap for efficient lookup by key.
Alternatively, you could import the CSVs into a database and make use of its join algorithms.
Or if you'd rather reinvent the wheel while scrupolously avoiding memory use, you could first sort the CSVs (by partitioning them into sets small enough to sort in memory, and then doing a k-way merge to merge the sublists), and then to a merge join. But the other solutions are likely to be a lot easier to implement :-)
I am currently new to machine learning and I will be working on a project that involves using a Machine Learning library to detect and alert about possible anomalies. I will be using Apache Spark and I decided to use the KMeans method to solve the project.
The main project consists on analyzing daily files and detecting fluctuating changes in some of the records and reporting them as possible anomalies (if they are considered one based on the model). The files are generated at the end of a day and my program needs to check them on the morning of the next day to see if there is an anomaly. However, I need to check anomalies file vs file, NOT within the file. This means that I have to compare the data of every file and see if it fits to the model I would create following the specific algorithm. What I'm trying to say is that I have some valid data that I will apply the algorithm to in order to train my model. Then I have to apply this same model to other files of the same format but, obviously, different data. I'm not looking for a prediction column but rather detecting anomalies in these other files. If there is an anomaly the program should tell me which row/column has the anomaly and then I have to program it to send an email saying that there is a possible anomaly in the specific file.
Like I said I am new to machine learning. I want to know how I can use the KMeans algorithm to detect outliers/anomalies on a file.
So far I have created the model:
SparkConf conf = new SparkConf().setAppName("practice").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession
.builder()
.appName("Anomaly Detection")
.getOrCreate();
String day1txt = "C:\\Users\\User\\Documents\\day1.txt";
String day2txt = "C:\\Users\\User\\Documents\\day2.txt";
Dataset<Row> day1 = spark.read().
option("header", "true").
option("delimiter", "\t").
option("inferSchema", "true").
csv(day1txt);
day1 = day1.withColumn("Size", day1.col("Size").cast("Integer"));
day1 = day1.withColumn("Records", day1.col("Records").cast("Integer"));
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"Size", "Records"})
.setOutputCol("features");
Dataset<Row> day1vector = assembler.transform(day1);
KMeans kmeans = new KMeans().setK(5).setSeed(1L);
KMeansModel model = kmeans.fit(day1vector);
I don't know what to do from this point on to detect outliers. I have several other .txt files that should have "normalized" data, and also I have a couple of files that have "tampered/not-normalized" data. Do I need to train my model with all the test data I have available, and if so, how can I train a model using different datasets? Or can I only train it with one dataset and test it with the others?
EDIT:
This is a sample of the file (day1.txt) I will be using (dummy data of course / top 10)
Name Size Records
File1 1000 104370
File2 990 101200
File3 1500 109123
File4 2170 113888
File5 2000 111974
File6 1820 110666
File7 1200 106771
File8 1500 108991
File9 1000 104007
File10 1300 107037
This is considered normal data, and I will have different files with the same format but different values around the same range. Then I have some files where I purposely added an outlier, like Size: 1000, Records: 50000.
How can I detect that with KMeans? Or if KMeans is not the perfect model, which model should I use and how should I go around it?
There is a simple approach for this. create your clusters with kmeans, then for each clusters, set some good radius with respect to center of that cluster, if some point lie out of that radius, it is an outlier.
Try looking at this: https://arxiv.org/pdf/1402.6859.pdf
There is some outlier detection Technics like: OneClassSvm or AngleBaseOutlierDetection and so on. Try looking at this: http://scikit-learn.org/stable/modules/outlier_detection.html
I have a file that is 21.6GB and I want to read it from the end to the start rather than from the beginning to the end as you would usually do.
If I read each line of the file from the start to the end using the following code, then it takes 1 minute, 12 seconds.
val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLine {
val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())
Now, I have read that to read in a file in reverse then I should use ReversedLinesFileReader from Apache Commons. I have created the following extension function to do just this:
fun File.forEachLineFromTheEndOfFile(action: (line: String) -> Unit) {
val reader = ReversedLinesFileReader(this, Charset.defaultCharset())
var line = reader.readLine()
while (line != null) {
action.invoke(line)
line = reader.readLine()
}
reader.close()
}
and then call it in the following way, which is the same as the previous way only with a call to forEachLineFromTheEndOfFile function:
val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLineFromTheEndOfFile {
val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())
This took 17 minutes and 50 seconds to run!
Am I using ReversedLinesFileReader in the correct way?
I am running Linux Mint with an Ext4 file system on an SSD. Could this have anything to do with it?
Is it just the case that files should not be read from the end to the start?
You are asking for a very expensive operation. Not only are you using random access in blocks to read the file and going backwards (so if the file system is reading ahead, it is reading the wrong direction), you are also reading an XML file which is UTF-8 and the encoding is slower than a fixed byte encoding.
Then on top of that you are using a less than efficient algorithm. It reads a block at a time of inconvenient size (is it disk block size aware? are you setting the block size to match your file system?) backwards while processing encoding and makes (unnecessary?) copy of the partial byte array and then turns it into a string (do you need a string to parse?). It could create the string without the copy and really creating the string probably could be deferred and you work directly from the buffer only decoding if you need to (XML parsers for example also work from ByteArrays or buffers). And there are other array copies that just are not needed but it is more convenient for the code.
It also might have a bug in that it checks for newlines without considering that the character might mean something different if actually is part of a multi-byte sequence. It would have to look back a few extra characters to check this for variable length encodings, I don't see it doing that.
So instead of a nice forward only heavily buffered sequential read of a file which is the fastest thing you can do on your filesystem, you are doing random reads of 1 block at a time. It should at least read multiple disk blocks so that it can use the forward momentum (set blocksize to some multiple of your disk block size will help) and also avoid the number of "left over" copies made at buffer boundaries.
There are probably faster approaches. But it'll not be as fast as reading a file in forward order.
UPDATE:
Ok, so I tried an experiment with a rather silly version that processes around 27G of data by reading the first 10 million lines from wikidata JSON dump and reversing those lines.
Timings on my 2015 Mac Book Pro (with all my dev stuff and many chrome windows open eating memory and some CPU all the time, about 5G of total memory is free, VM size is default with no parameters set at all, not run under debugger):
reading in reverse order: 244,648 ms = 244 secs = 4 min 4 secs
reading in forward order: 77,564 ms = 77 secs = 1 min 17 secs
temp file count: 201
approx char count: 29,483,478,770 (line content not including line endings)
total line count: 10,050,000
The algorithm is to read the original file by lines buffering 50000 lines at a time, writing the lines in reverse order to a numbered temp file. Then after all files are written, they are read in reverse numerical order forward by lines. Basically dividing them into reverse sort order fragments of the original. It could be optimized because this is the most naive version of that algorithm with no tuning. But it does do what file systems do best, sequential reads and sequential writes with good sized buffers.
So this is a lot faster than the one you were using and it could be tuned from here to be more efficient. You could trade CPU for disk I/O size and try using gzipped files as well, maybe a two threaded model to have the next buffer gzipping while processing the previous. Less string allocations, checking each file function to make sure nothing extra is going on, make sure no double buffering, and more.
The ugly but functional code is:
package com.stackoverflow.reversefile
import java.io.File
import java.util.*
fun main(args: Array<String>) {
val maxBufferSize = 50000
val lineBuffer = ArrayList<String>(maxBufferSize)
val tempFiles = ArrayList<File>()
val originalFile = File("/data/wikidata/20150629.json")
val tempFilePrefix = "/data/wikidata/temp/temp"
val maxLines = 10000000
var approxCharCount: Long = 0
var tempFileCount = 0
var lineCount = 0
val startTime = System.currentTimeMillis()
println("Writing reversed partial files...")
try {
fun flush() {
val bufferSize = lineBuffer.size
if (bufferSize > 0) {
lineCount += bufferSize
tempFileCount++
File("$tempFilePrefix-$tempFileCount").apply {
bufferedWriter().use { writer ->
((bufferSize - 1) downTo 0).forEach { idx ->
writer.write(lineBuffer[idx])
writer.newLine()
}
}
tempFiles.add(this)
}
lineBuffer.clear()
}
println(" flushed at $lineCount lines")
}
// read and break into backword sorted chunks
originalFile.bufferedReader(bufferSize = 4096 * 32)
.lineSequence()
.takeWhile { lineCount <= maxLines }.forEach { line ->
lineBuffer.add(line)
if (lineBuffer.size >= maxBufferSize) flush()
}
flush()
// read backword sorted chunks backwards
println("Reading reversed lines ...")
tempFiles.reversed().forEach { tempFile ->
tempFile.bufferedReader(bufferSize = 4096 * 32).lineSequence()
.forEach { line ->
approxCharCount += line.length
// a line has been read here
}
println(" file $tempFile current char total $approxCharCount")
}
} finally {
tempFiles.forEach { it.delete() }
}
val elapsed = System.currentTimeMillis() - startTime
println("temp file count: $tempFileCount")
println("approx char count: $approxCharCount")
println("total line count: $lineCount")
println()
println("Elapsed: ${elapsed}ms ${elapsed / 1000}secs ${elapsed / 1000 / 60}min ")
println("reading original file again:")
val againStartTime = System.currentTimeMillis()
var againLineCount = 0
originalFile.bufferedReader(bufferSize = 4096 * 32)
.lineSequence()
.takeWhile { againLineCount <= maxLines }
.forEach { againLineCount++ }
val againElapsed = System.currentTimeMillis() - againStartTime
println("Elapsed: ${againElapsed}ms ${againElapsed / 1000}secs ${againElapsed / 1000 / 60}min ")
}
The correct way to investigate this problem would be:
Write a version of this test in pure Java.
Benchmark it to make sure that the performance problem is still there.
Profile it to figure out where the performance bottleneck is.
Q: Am I using ReversedLinesFileReader in the correct way?
Yes. (Assuming that it is an appropriate thing to use a line reader at all. That depends on what it is you are really trying to do. For instance, if you just wanted to count lines backwards, then you should be reading 1 character at a time and counting the newline sequences.)
Q: I am running Linux Mint with an Ext4 file system on an SSD. Could this have anything to do with it?
Possibly. Reading a file in reverse means that the read-ahead strategies used by the OS to give fast I/O may not work. It could be interacting with the characteristics of an SSD.
Q: Is it just the case that files should not be read from the end to the start?
Possibly. See above.
The other thing that you have not considered is that your file could actually contain some extremely long lines. The bottleneck could be the assembly of the characters into (long) lines.
Looking at the source code, it would seem that there is potential for O(N^2) behavior when lines are very long. The critical part is (I think) in the way that "rollover" is handled by FilePart. Note the way that the "left over" data gets copied.
We have to analyze the log files using hadoop as it can handle large data easily. So, I wrote one piece of mapreduce program. But even my mapreduce program is taking lot of time to get the data.
String keys[] = value.toString().split(" ");
int keysLength = keys.length;
if(keysLength > 4 && StringUtils.isNumeric(keys[keysLength-5])) {
this.keyWords.set(keys[0]+"-"+keys[1]+" "+keys[2]+" "+keys[keysLength-5]+" "+keys[keysLength-2]);
context.write(new IntWritable(1), keyWords);
}
The requirement is, we will have mostly 10 to 15 of .gz files and every .gz file have one log file inside. we have to pull the data from that log file to analyze it.
Sample input in the log file:
2015-09-12 03:39:45.201 [service_client] [anhgv-63ac7ca63ac] [[ACTIVE]
ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)']
INFO TempServerImplementation - || Server: loclhost 121.1.0.0 |
Service Category: Testing | Service Method: add | Application Id: Test
| Status Code: 200 | Duration: 594ms ||
So could someone help me how can I tune up the performance.
Thanks
Sai
You can try using SPARK(We can think this as in memory map reduce), it is 10x to 100x faster than traditional Map reduce. Please check trade-offs between hadoop map-reduce and SPARK before using.
There are two main ways you can speed up your job, input size and variable initialisation.
Input Size
gz is not a splittable format. That means that if you have 15 input gz files, you will only have 15 mappers. I can see from the comments that each gz file is 50MB, so at a generous 10:1 compression ratio, each mapper would be processing 500MB. This can take time, and unless you've got a <15 node cluster, you'll have nodes that are doing nothing. By uncompressing the data before the MR job you could have more mappers which would reduce the runtime.
Variable Initialisation
In the below line:
context.write(new IntWritable(1), keyWords);
you're generating a big overheard by allocating a brand new IntWritable for each output. Instead, why not allocate it at the top of the class? It doesn't change, so it doesn't need allocating each time.
For example:
private static final IntWritable ONE_WRITABLE = new IntWritable(1);
...
context.write(ONE_WRITABLE, keyWords);
The same applies to the strings you use - " " and "-". Assign them as static variables also and again avoid creating fresh ones each time.
(Sometimes our host is wrong; nanoseconds matter ;)
I have a Python Twisted server that talks to some Java servers and profiling shows spending ~30% of its runtime in the JSON encoder/decoder; its job is handling thousands of messages per second.
This talk by youtube raises interesting applicable points:
Serialization formats - no matter which one you use, they are all
expensive. Measure. Don’t use pickle. Not a good choice. Found
protocol buffers slow. They wrote their own BSON implementation which
is 10-15 time faster than the one you can download.
You have to measure. Vitess swapped out one its protocols for an HTTP
implementation. Even though it was in C it was slow. So they ripped
out HTTP and did a direct socket call using python and that was 8%
cheaper on global CPU. The enveloping for HTTP is really expensive.
Measurement. In Python measurement is like reading tea leaves.
There’s a lot of things in Python that are counter intuitive, like
the cost of grabage colleciton. Most of chunks of their apps spend
their time serializing. Profiling serialization is very depending on
what you are putting in. Serializing ints is very different than
serializing big blobs.
Anyway, I control both the Python and Java ends of my message-passing API and can pick a different serialisation than JSON.
My messages look like:
a variable number of longs; anywhere between 1 and 10K of them
and two already-UTF8 text strings; both between 1 and 3KB
Because I am reading them from a socket, I want libraries that can cope gracefully with streams - its irritating if it doesn't tell me how much of a buffer it consumed, for example.
The other end of this stream is a Java server, of course; I don't want to pick something that is great for the Python end but moves problems to the Java end e.g. performance or torturous or flaky API.
I will obviously be doing my own profiling. I ask here in the hope you describe approaches I wouldn't think of e.g. using struct and what the fastest kind of strings/buffers are.
Some simple test code gives surprising results:
import time, random, struct, json, sys, pickle, cPickle, marshal, array
def encode_json_1(*args):
return json.dumps(args)
def encode_json_2(longs,str1,str2):
return json.dumps({"longs":longs,"str1":str1,"str2":str2})
def encode_pickle(*args):
return pickle.dumps(args)
def encode_cPickle(*args):
return cPickle.dumps(args)
def encode_marshal(*args):
return marshal.dumps(args)
def encode_struct_1(longs,str1,str2):
return struct.pack(">iii%dq"%len(longs),len(longs),len(str1),len(str2),*longs)+str1+str2
def decode_struct_1(s):
i, j, k = struct.unpack(">iii",s[:12])
assert len(s) == 3*4 + 8*i + j + k, (len(s),3*4 + 8*i + j + k)
longs = struct.unpack(">%dq"%i,s[12:12+i*8])
str1 = s[12+i*8:12+i*8+j]
str2 = s[12+i*8+j:]
return (longs,str1,str2)
struct_header_2 = struct.Struct(">iii")
def encode_struct_2(longs,str1,str2):
return "".join((
struct_header_2.pack(len(longs),len(str1),len(str2)),
array.array("L",longs).tostring(),
str1,
str2))
def decode_struct_2(s):
i, j, k = struct_header_2.unpack(s[:12])
assert len(s) == 3*4 + 8*i + j + k, (len(s),3*4 + 8*i + j + k)
longs = array.array("L")
longs.fromstring(s[12:12+i*8])
str1 = s[12+i*8:12+i*8+j]
str2 = s[12+i*8+j:]
return (longs,str1,str2)
def encode_ujson(*args):
return ujson.dumps(args)
def encode_msgpack(*args):
return msgpacker.pack(args)
def decode_msgpack(s):
msgunpacker.feed(s)
return msgunpacker.unpack()
def encode_bson(longs,str1,str2):
return bson.dumps({"longs":longs,"str1":str1,"str2":str2})
def from_dict(d):
return [d["longs"],d["str1"],d["str2"]]
tests = [ #(encode,decode,massage_for_check)
(encode_struct_1,decode_struct_1,None),
(encode_struct_2,decode_struct_2,None),
(encode_json_1,json.loads,None),
(encode_json_2,json.loads,from_dict),
(encode_pickle,pickle.loads,None),
(encode_cPickle,cPickle.loads,None),
(encode_marshal,marshal.loads,None)]
try:
import ujson
tests.append((encode_ujson,ujson.loads,None))
except ImportError:
print "no ujson support installed"
try:
import msgpack
msgpacker = msgpack.Packer()
msgunpacker = msgpack.Unpacker()
tests.append((encode_msgpack,decode_msgpack,None))
except ImportError:
print "no msgpack support installed"
try:
import bson
tests.append((encode_bson,bson.loads,from_dict))
except ImportError:
print "no BSON support installed"
longs = [i for i in xrange(10000)]
str1 = "1"*5000
str2 = "2"*5000
random.seed(1)
encode_data = [[
longs[:random.randint(2,len(longs))],
str1[:random.randint(2,len(str1))],
str2[:random.randint(2,len(str2))]] for i in xrange(1000)]
for encoder,decoder,massage_before_check in tests:
# do the encoding
start = time.time()
encoded = [encoder(i,j,k) for i,j,k in encode_data]
encoding = time.time()
print encoder.__name__, "encoding took %0.4f,"%(encoding-start),
sys.stdout.flush()
# do the decoding
decoded = [decoder(e) for e in encoded]
decoding = time.time()
print "decoding %0.4f"%(decoding-encoding)
sys.stdout.flush()
# check it
if massage_before_check:
decoded = [massage_before_check(d) for d in decoded]
for i,((longs_a,str1_a,str2_a),(longs_b,str1_b,str2_b)) in enumerate(zip(encode_data,decoded)):
assert longs_a == list(longs_b), (i,longs_a,longs_b)
assert str1_a == str1_b, (i,str1_a,str1_b)
assert str2_a == str2_b, (i,str2_a,str2_b)
gives:
encode_struct_1 encoding took 0.4486, decoding 0.3313
encode_struct_2 encoding took 0.3202, decoding 0.1082
encode_json_1 encoding took 0.6333, decoding 0.6718
encode_json_2 encoding took 0.5740, decoding 0.8362
encode_pickle encoding took 8.1587, decoding 9.5980
encode_cPickle encoding took 1.1246, decoding 1.4436
encode_marshal encoding took 0.1144, decoding 0.3541
encode_ujson encoding took 0.2768, decoding 0.4773
encode_msgpack encoding took 0.1386, decoding 0.2374
encode_bson encoding took 55.5861, decoding 29.3953
bson, msgpack and ujson all installed via easy_install
I would love to be shown I'm doing it wrong; that I should be using cStringIO interfaces or however else you speed it all up!
There must be a way to serialise this data that is an order of magnitude faster surely?
While JSon is flexible, it is one of the slowest serialization formats in Java (possible python as well) in nano-seconds matter I would use a binary format in native byte order (likely to be little endian)
Here is a library were I do exactly that AbstractExcerpt and UnsafeExcerpt A typical message takes 50 to 200 ns to serialize and send or read and deserialize.
In the end, we chose to use msgpack.
If you go with JSON, your choice of library on Python and Java is critical to performance:
On Java, http://blog.juicehub.com/2012/11/20/benchmarking-web-frameworks-for-games/ says:
Performance was absolutely atrocious until we swapped out the JSON Lib (json-simple) for Jackon’s ObjectMapper. This brought RPS for 35 to 300+ – a 10x increase
You may be able to speed up the struct case
def encode_struct(longs,str1,str2):
return struct.pack(">iii%dq"%len(longs),len(longs),len(str1),len(str2),*longs)+str1+str2
Try using the python array module and the method tostring to convert your longs into a binary string. Then you can append it like you did with the strings
Create a struct.Struct object and use that. I believe it's more efficient
You can also look into:
http://docs.python.org/library/xdrlib.html#module-xdrlib
Your fastest method encodes 1000 elements in .1222 seconds. That's 1 element in .1222 milliseconds. That's pretty fast. I doubt you'll do much better without switching languages.
I know this is an old question, but it is still interesting. My most recent choice was to use Cap’n Proto, written by the same guy who did protobuf for google. In my case, that lead to a decrease in both time and volume around 20% compared to Jackson's JSON encoder/decoder (server to server, Java on both sides).
Protocol Buffers are pretty fast and have bindings for both Java and Python. It's quite a popular library and used inside Google so it should be tested and optimized quite well.
Since the data your sending is already well defined, non-recursive, and non-nested, why not just use a simple delimited string. You just need a delimiter that isn't contained in your string variables maybe '\n'.
"10\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\nFoo Foo Foo\nBar Bar Bar"
Then just use a simple String Split method.
string[] temp = str.split("\n");
ArrayList<long> longs = new ArrayList<long>(long.parseLong(temp[0]));
string string1 = temp[temp.length-2];
string string2 = temp[temp.length-1];
for(int i = 1; i < temp.length-2 ; i++)
longs.add(long.parseLong(temp[i]));
note The above was written in the web browser and untested so syntax errors may exist.
For a text based; I'd assume the above is the fastest method.