Data Mining and patterns recognition in CSV file (Python ML)

Data Mining and patterns recognition in CSV file (Python ML) - java

I am new to the world of ML and data mining and looking for the help and guidance to find unusual behavior on my log file.
Assuming I have a cvs file which logs users sessions start time and end time and the policy numbers that they have worked on similar to below.
Start_date, username, end_date, Policy_numbers
2018-01-02 10:01, user1, 2018-01-02 10:10, PO-123
2018-01-02 10:05, user2, 2018-01-02 10:20, PO-456
2018-01-02 10:11, user1, 2018-01-02 10:45, PO-789 | PO-999 (| is delimiter here)
Is there any Python or java library/module/code or open-source application to identify patterns such as : Most users logged in during 10 AM to 5 PM , Average number of sessions per days in a month, Average length of session and … ,
I expect application recognize various of patterns and suggest it to me in a list or way that I can pick those one which matters to business.
(If I recognize the pattern then I can find the answers by some queries and no need for pattern recognition- that would be an easy job to do)
Then is there a way to train the system by these recognized patterns to find unusual behaviors such as : find users who logged in after 5:00 PM , find sessions took way longer than average and ...
Thanks for any thought.

Related

Deleting LDAP record with 0x0A in CN (Java)

I'm trying to delete ADLDS user records created by Microsoft's conflict resolution model.
Microsoft describes the creation of the new records as
The new RDN will be <Old RDN>\0ACNF:<objectGUID>
These are the records I'm trying to delete from my environment.
My search for uid=baduser will return two CNs:
cn=John R. Doe 123456
and
cn=John R. Doe 123456 CNF:123e4567-e89b-12d3-a456-426614174000
The second record has the \0A in the cn.
Executing a ctx.destroySubcontext(cn) on it errors out like this:
cn=John R. Doe 123456CNF:123e4567-e89b-12d3-a456-426614174000,c=US: [LDAP: error code 34 - 0000208F: NameErr: DSID-0310022D, problem 2006 (BAD_NAME), data 8349
What am I missing to be able to delete a record with a cn that contains a line feed character?
note: I also can't seem to read/modify this \0A record using JXplorer. Clicking on the record after a search results in the same BAD_NAME error.

String commonName = attr.get("cn").get().toString().replace("\n", "\\\\0A");
A simple replacement of the \n character worked for me.

RestFB get events in Java

Using FQL, by means of which finds events that contain a given word. FQL works only in API version <2.1. By which I use the Graph API Explorer to display events. Eg.
search?q=york&type=event
Example of FQL
SELECT Eid, name, location, start_time, description, pic_small, creator, event venue FROM WHERE start_time> "Sun Jun 21 0:00:35 GMT 2015" AND (CONTAINS ("york")
I would like to make a search events by using RestFB not using FQL, but do not know how. The documentation is scarce.

I answered this already on github, but perhaps someone else find this useful.
Your special case is not in the documentation, but you can transfer the knowledge you find in the documentation and solve your problem: http://restfb.com/#searching
Connection<Event> eventList =
facebookClient.fetchConnection("search", Event.class,
Parameter.with("q", "york"), Parameter.with("type", "event"));
Now, you can iterate over the eventList.
Here you can find how this can be done: http://restfb.com/#fetching-connections

regex extraction of data

I have a hundred Whois files of different top level domains(.com, .se, .uk, .cz etc.). Each has a different format. My main task is to extract information such as registrar, registrant, expiry date, updated date etc. The below code works for com. net. org & info. I am using Java SE 6.
Admin contact: "\\bAdmin\\sEmail:\\s*\\w+\\-*\\w*\\.*\\w*#\\w+(\\.\\w+)+"
Technical contact: "\\bTech\\sEmail:\\s*\\w+\\-*\\w*\\.*\\w*#\\w+(\\.\\w+)+"
Whois Registrant: "\\bRegistrant\\sName:\\s*\\w+\\-*\\.*\\w+\\s*\\w*"
Registrar: "\\bRegistrar:\\w+\\.*\\w*"
Registered on Date: "\\bCreation\\sDate:\\s*\\d+-\\d+-\\d+T\\d+:\\d+:\\d+Z"
Expiry Date: "\\bExpiry\\sDate:\\s*\\d+-\\d+-\\d+T\\d+:\\d+:\\d+Z"
Updated Date: "\\bUpdated\\sDate:\\s*\\d+-\\d+-\\d+T\\d+:\\d+:\\d+Z"
Name Servers: "\\bName\\sServer:\\s*\\w+\\d*\\.*\\w*\\-*\\w*(\\.\\w+)+"
Registrant Status: "\\bDomain\\sStatus:\\s*\\w+"
How do I add alternatives for each of the above points for other TLDs. For example :
I would like to have Name Servers:
"\\bName\\sServer:\\s*\\w+\\d*\\.*\\w*\\-*\\w*(\\.\\w+)+"
OR
alternative pattern
OR
alternative Pattern
Is it doable? If not is there an alternative way?

Alternative patterns can be concatenated with the | operator:
"\\bName\\sServer:\\s*\\w+\\d*\\.*\\w*\\-*\\w*(\\.\\w+)+|alternative pattern|alternative Pattern"
(If this isn't what you need, then your question should be reformulated.)

How LDAP search filter string accepts space?

I feel a bit nervous because this is my first question here at Stack Overflow. Please let me know if I am not doing it in a good manner.
In LDAP, I think the following search filter string works.
( & (uid=tt4cs) (objectClass=inetOrgPerson) )
It means searching for entries, one of whose uid is tt4cs and one of whose objectClass is inetOrgPerson.
Please note that there are spaces between every parenthesis and ampersand, which will just be ignored. But, as far as I read RFC4515, I can find no implication that allows any space that way. Could anybody kindly tell me whether it is allowed by any other standards or it is just so by convention?
Update on Jan 13, 2014
I have tested it in three ways. (LDAP server in my environment is OpenLDAP 2.4.38)
(1) Do ldapsearch on command line. The above search filter works and gets a result.
(2) Search by using UnboundID LDAP SDK for Java. This API does not send the search request to the server, but throws an exception that says "Unexpected closing parenthesis found at position 15 of the filter string."
String filter = "( & (uid=tt4cs) (objectClass=inetOrgPerson) )";
SearchResult searchResult
= connection.search("dc=localdomain", SearchScope.SUB, filter);
(3) Search by using Apache Directory LDAP API. This API does not send the search request to the server, but throws an exception that says "The filter ( & (uid=tt4cs) (objectClass=inetOrgPerson) ) is invalid."
String filter = "( & (uid=tt4cs) (objectClass=inetOrgPerson) )";
EntryCursor cursor
= connection.search("dc=localdomain", filter, SearchScope.SUBTREE);
Now I have a feeling that acceptance of the extra spaces may probably be an implementation-dependent behavior, and that it is better to avoid it.

How to analyze execution times from the log?

I'm using log4j with the %d ... conversion pattern, which makes every log message begin with a timestamp like so: 2011-06-26 14:34:16,357. I log each SQL query I submit.
I would like to analyze deltas between SQL queries, and maybe even aggregate multiple execution of the exact same SQL query for max-time and average-time..
How would you approach this? using grep and some excel work? Is there some common way/tool/script that would make my life easy?
P.S. to make things more annoying, my SQLs are multi-lines, so log4jdbc sqltiming logger prints them like so:
2011-06-26 14:43:32,112 [SelectCampaignTask ] INFO : jdbc.sqltiming - CREATE INDEX idx ON tab CRLF
USING btree (id1, id2, emf); {executed in 34788 msec}

I would be tempted to write a Groovy/Perl/Python script to pick apart the logs using a regular expression.
If you dump the output to CSV you can certainly use Excel to data mine.
An alternative would be to write the DateTime, thread, category level and the log message to a database table. Writing a SQL query to write reports is a really easy way of generating custom reports w.r.t time ranges, like filters etc.
Mining log files seems to be a rite of passage for most developers and is often a good time to learn a scripting language...

I just solved the same issue by writing down a small script in Python. I am a totally newbie of Python and I was able to get it working in less than a couple of hours.
Here are the key parts of my code:
import re
logfile = open("jdbcPerf.log", "r").readlines()
#extract the interesting lines
for line in logfile:
m= re.search('^((\d+)-(\d+)-(\d+)) | ({executed )', line)
if m:
print m.group()
#extract name of servlet and execution time
for line in selectedLines:
#extract servlet name
m = re.search('servlets.([a-zA-Z]*).([a-zA-Z]*)', line)
if m:
print m.group()
#extract execution time
m = re.search('( \d+ )',line)
if m:
print m.group()
You can use this as a skeleton to then do whatever data aggregation you need.
My log file looks like this:
2013-05-26 08:22:10,583 DEBUG [jdbc.sqltiming]
com.myclass.servlets.BrowseCategories.categoryList(null:-1)
16. select category0_.id as id, category0_.name as name from categories category0_
{executed in 7 msec}

LogMX is a log viewer tool that can export any log file to CSV, while parsing the date and handling multi-line log events. You can also (in its GUI) compute the time elapsed between several log events.
To do so, you first need to describe (in LogMX) your log format using a Log4j pattern or a regular expression.
PS: you can export log files from command line using this tool (console mode provided).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Data Mining and patterns recognition in CSV file (Python ML) - java

Related

Deleting LDAP record with 0x0A in CN (Java)

RestFB get events in Java

regex extraction of data

How LDAP search filter string accepts space?

How to analyze execution times from the log?

Categories

Resources