I am currently new to machine learning and I will be working on a project that involves using a Machine Learning library to detect and alert about possible anomalies. I will be using Apache Spark and I decided to use the KMeans method to solve the project.
The main project consists on analyzing daily files and detecting fluctuating changes in some of the records and reporting them as possible anomalies (if they are considered one based on the model). The files are generated at the end of a day and my program needs to check them on the morning of the next day to see if there is an anomaly. However, I need to check anomalies file vs file, NOT within the file. This means that I have to compare the data of every file and see if it fits to the model I would create following the specific algorithm. What I'm trying to say is that I have some valid data that I will apply the algorithm to in order to train my model. Then I have to apply this same model to other files of the same format but, obviously, different data. I'm not looking for a prediction column but rather detecting anomalies in these other files. If there is an anomaly the program should tell me which row/column has the anomaly and then I have to program it to send an email saying that there is a possible anomaly in the specific file.
Like I said I am new to machine learning. I want to know how I can use the KMeans algorithm to detect outliers/anomalies on a file.
So far I have created the model:
SparkConf conf = new SparkConf().setAppName("practice").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession
.builder()
.appName("Anomaly Detection")
.getOrCreate();
String day1txt = "C:\\Users\\User\\Documents\\day1.txt";
String day2txt = "C:\\Users\\User\\Documents\\day2.txt";
Dataset<Row> day1 = spark.read().
option("header", "true").
option("delimiter", "\t").
option("inferSchema", "true").
csv(day1txt);
day1 = day1.withColumn("Size", day1.col("Size").cast("Integer"));
day1 = day1.withColumn("Records", day1.col("Records").cast("Integer"));
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"Size", "Records"})
.setOutputCol("features");
Dataset<Row> day1vector = assembler.transform(day1);
KMeans kmeans = new KMeans().setK(5).setSeed(1L);
KMeansModel model = kmeans.fit(day1vector);
I don't know what to do from this point on to detect outliers. I have several other .txt files that should have "normalized" data, and also I have a couple of files that have "tampered/not-normalized" data. Do I need to train my model with all the test data I have available, and if so, how can I train a model using different datasets? Or can I only train it with one dataset and test it with the others?
EDIT:
This is a sample of the file (day1.txt) I will be using (dummy data of course / top 10)
Name Size Records
File1 1000 104370
File2 990 101200
File3 1500 109123
File4 2170 113888
File5 2000 111974
File6 1820 110666
File7 1200 106771
File8 1500 108991
File9 1000 104007
File10 1300 107037
This is considered normal data, and I will have different files with the same format but different values around the same range. Then I have some files where I purposely added an outlier, like Size: 1000, Records: 50000.
How can I detect that with KMeans? Or if KMeans is not the perfect model, which model should I use and how should I go around it?
There is a simple approach for this. create your clusters with kmeans, then for each clusters, set some good radius with respect to center of that cluster, if some point lie out of that radius, it is an outlier.
Try looking at this: https://arxiv.org/pdf/1402.6859.pdf
There is some outlier detection Technics like: OneClassSvm or AngleBaseOutlierDetection and so on. Try looking at this: http://scikit-learn.org/stable/modules/outlier_detection.html
Related
I have s3 with terabytes of data, separated to small files less than 5 mb.
I try to use flink to process them.
I create source with next code.
var inputFormat = new TextInputFormat(null);
inputFormat.setNestedFileEnumeration(true);
return streamExecutionEnvironment.readFile(inputFormat, "s3://name/");
But used memory growing up to limit, and job killed, and not scheduled again with error:
Could not fulfill resource requirements of job
Without data in sink.
On small set of data it works fine.
How I can read files without using too much memory?
Thanks.
same behaviour with:
env.fromSource( FileSource.forRecordStreamFormat(
new TextLineFormat(),
new Path("s3://name/")
)
.monitorContinuously(Duration.ofMillis(10000L))
.build(),
WatermarkStrategy.noWatermarks(),
"MySourceName"
)
The FileSource is the preferred way to ingest data from files. It should be able to handle the sort of scale you are talking about.
docs
javadocs
setQueueLimit on kinesis producer solved my problem https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/kinesis/#backpressure
Are there any settings that I can use to make the output go in a separate timestamped (give a format) directory every time I run the job?
I use the following Scalding code to write my flow output:
val out = TypedPipe[MyType]
out.write(PackedAvroSource[MyType]("my/output/path"))
By default Scalding replaces the output in the my/output/path directory in HDFS. I'd like the output to go into a different my/output/path/MMDDyyyyHHmm/ path depending on when the job runs. I am about to write some utils to add a timestamp to the path myself by I'd rather use some existing ones if available.
Try to concatenate the date to the directory.
Date date = new Date();
String direct = "my/output/" + date.toString();
out.write(PackedAvroSource[MyType](direct));
For more information on date and time click here
You can use PartitionedDelimited sink to write to multiple directories. See comments in https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/typed/PartitionedDelimitedSource.scala for more info.
This would preclude you from using AVRO format, but perhaps you could write PartitionedPackedAvro?
How do you merge two .odt files? Doing that by hand, opening each file and copying the content would work, but is unfeasable.
I have tried odttoolkit Simple API (simple-odf-0.8.1-incubating) to achieve that task, creating an empty TextDocument and merging everything into it:
private File masterFile = new File(...);
...
TextDocument t = TextDocument.newTextDocument();
t.save(masterFile);
...
for(File f : filesToMerge){
joinOdt(f);
}
...
void joinOdt(File joinee){
TextDocument master = (TextDocument) TextDocument.loadDocument(masterFile);
TextDocument slave = (TextDocument) TextDocument.loadDocument(joinee);
master.insertContentFromDocumentAfter(slave, master.getParagraphByReverseIndex(0, false), true);
master.save(masterFile);
}
And that works reasonably well, however it looses information about fonts - original files are a combination of Arial Narrow and Windings (for check boxes), output masterFile is all in TimesNewRoman. At first I suspected last parameter of insertContentFromDocumentAfter, but changing it to false breaks (almost) all formatting. Am I doing something wrong? Is there any other way?
I think this is "works as designed".
I tried this once with a global document, which imports documents and display them as is... as long as paragraph styles have different names !
Using same named templates are overwritten with the values the "master" document have.
So I ended up cloning standard styles with unique (per document) names.
HTH
Ma case was a rather simple one, files I wanted to merge were generated the same way and used the same basic formatting. Therefore, starting off of one of my files, instead of an empty document fixed my problem.
However this question will remain open until someone comes up with a more general solution to formatting retention (possibly based on ngulams answer and comments?).
I have the following feeds from my vendor,
http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule
I wanted to get the data from that xml files as java objects, so that I can insert into my database regularly.
The above data is nothing but regular updates from the vendor, so that I can update in my website.
can you please suggest me what are my options available to get this working
Should I use any webservices or just Xstream
to get my final output.. please suggest me as am a new comer to this concept
Vendor has suggested me that he can give me the data in following 3 formats rss, xml or json, I am not sure what is easy and less consumable to get it working
I would suggest just write a program that parses the XML and inserts the data directly into your database.
Example
This groovy script inserts data into a H2 database.
//
// Dependencies
// ============
import groovy.sql.Sql
#Grapes([
#Grab(group='com.h2database', module='h2', version='1.3.163'),
#GrabConfig(systemClassLoader=true)
])
//
// Main program
// ============
def sql = Sql.newInstance("jdbc:h2:db/cricket", "user", "pass", "org.h2.Driver")
def dataUrl = new URL("http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule")
dataUrl.withReader { reader ->
def feeds = new XmlSlurper().parse(reader)
feeds.matches.match.each {
def data = [
it.id,
it.name,
it.type,
it.tournamentId,
it.location,
it.date,
it.GMTTime,
it.localTime,
it.description,
it.team1,
it.team2,
it.teamId1,
it.teamId2,
it.tournamentName,
it.logo
].collect {
it.text()
}
sql.execute("INSERT INTO matches (id,name,type,tournamentId,location,date,GMTTime,localTime,description,team1,team2,teamId1,teamId2,tournamentName,logo) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", data)
}
}
Well... you could use an XML Parser (stream or DOM), or a JSON parser (again stream of 'DOM'), and build the objects on the fly. But with this data - which seems to consist of records of cricket matches, why not go with a csv format?
This seems to be your basic 'datum':
<id>1263</id>
<name>Australia v India 3rd Test at Perth - Jan 13-17, 2012</name>
<type>TestMatch</type>
<tournamentId>137</tournamentId>
<location>Perth</location>
<date>2012-01-14</date>
<GMTTime>02:30:00</GMTTime>
<localTime>10:30:00</localTime>
<description>3rd Test day 2</description>
<team1>Australia</team1>
<team2>India</team2>
<teamId1>7</teamId1>
<teamId2>1</teamId2>
<tournamentName>India tour of Australia 2011-12</tournamentName>
<logo>/cricket/137/tournament.png</logo>
Of course you would still have to parse a csv, and deal with character delimiting (such as when you have a ' or a " in a string), but it will reduce your network traffic quite substantially, and likely parse much faster on the client. Of course, this depends on what your client is.
Actually you have RESTful store that can return data in several formats and you only need to read from this source and no further interaction is needed.
So, you can use any XML Parser to parse XML data and put the extracted data in whatever data structure that you want or you have.
I did not hear about XTREME, but you can find more information about selecting the best parser for your situation at this StackOverflow question.
How long does sorting a 100MB XML file with Java take ?
The file has items with the following structure and I need to sort them by event
<doc>
<id>84141123</id>
<title>kk+ at Hippie Camp</title>
<description>photo by SFP</description>
<time>18945840</time>
<tags>elphinstone tribalharmonix vancouver intention intention7 newyears hippiecamp bc sunshinecoast woowoo kk kriskrug sunglasses smoking unibomber møtleykrüg </tags>
<geo></geo>
<event>47409</event>
</doc>
I'm on a Intel Dual Duo Core and 4GB RAM.
Minutes ? Hours ?
thanks
Here are the timings for a similar task executed using Saxon XQuery on a 100Mb input file.
Saxon-EE 9.3.0.4J from Saxonica
Java version 1.6.0_20
Analyzing query from {for $i in //item order by location return $i}
Analysis time: 195 milliseconds
Processing file:/e:/javalib/xmark/xmark100.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/e:/javalib/xmark/xmark100.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 6158 milliseconds
Tree size: 4787932 nodes, 79425460 characters, 381878 attributes
Execution time: 3.466s (3466ms)
Memory used: 471679816
So: about 6 seconds for parsing the input file and building a tree, 3.5 seconds for sorting it. That's invoked from the command line, but invoking it from Java will get very similar performance. Don't try to code the sort yourself - it's only a one-line query, and you are very unlikely to match the performance of an optimized XQuery engine.
i would say minutes - you shud be able to do that completely in-memory, so with a sax parser that would be reading-sorting-writing, should not be a problem for your hardware
I think a problem like this would be better sorted using serialisation.
Deserialise the XML file into an ArrayList of 'doc'.
Using straight Java code, apply sort on the event attribute and stored sorted arraylist in another variable.
Serialise out the sorted 'doc' ArrayList to file
If you do it in memory, you should be able to do this in under 10 seconds. You would be pusshing to do this under 2 seconds because it will spend that much times reading/writing to disk.
This program should use no more than 4-5x times the original file size. about 500 MB in your case.
String[] records = FileUtils.readFileToString(new File("my-file.xml")).split("</?doc>");
Map<Long, String> recordMap = new TreeMap<Long, String>();
for(int i=1;i<records.length;i+=2) {
String record = records[i];
int pos1 = record.indexOf("<id>");
int pos2 = record.indexOf("</id>", pos1+4);
long num = Long.parseLong(record.substring(pos1+3, pos2));
recordMap.put(num, record);
}
StringBuilder sb = new StringBuilder(records[0]);
for (String s : recordMap.values()) {
sb.append("<doc>").append(s).append("</doc>");
}
sb.append(records[records.length-1]);
FileUtils.writeStringToFile(new File("my-output-file.xml"), sb.toString());