I'm using log4j with the %d ... conversion pattern, which makes every log message begin with a timestamp like so: 2011-06-26 14:34:16,357. I log each SQL query I submit.
I would like to analyze deltas between SQL queries, and maybe even aggregate multiple execution of the exact same SQL query for max-time and average-time..
How would you approach this? using grep and some excel work? Is there some common way/tool/script that would make my life easy?
P.S. to make things more annoying, my SQLs are multi-lines, so log4jdbc sqltiming logger prints them like so:
2011-06-26 14:43:32,112 [SelectCampaignTask ] INFO : jdbc.sqltiming - CREATE INDEX idx ON tab CRLF
USING btree (id1, id2, emf); {executed in 34788 msec}
I would be tempted to write a Groovy/Perl/Python script to pick apart the logs using a regular expression.
If you dump the output to CSV you can certainly use Excel to data mine.
An alternative would be to write the DateTime, thread, category level and the log message to a database table. Writing a SQL query to write reports is a really easy way of generating custom reports w.r.t time ranges, like filters etc.
Mining log files seems to be a rite of passage for most developers and is often a good time to learn a scripting language...
I just solved the same issue by writing down a small script in Python. I am a totally newbie of Python and I was able to get it working in less than a couple of hours.
Here are the key parts of my code:
import re
logfile = open("jdbcPerf.log", "r").readlines()
#extract the interesting lines
for line in logfile:
m= re.search('^((\d+)-(\d+)-(\d+)) | ({executed )', line)
if m:
print m.group()
#extract name of servlet and execution time
for line in selectedLines:
#extract servlet name
m = re.search('servlets.([a-zA-Z]*).([a-zA-Z]*)', line)
if m:
print m.group()
#extract execution time
m = re.search('( \d+ )',line)
if m:
print m.group()
You can use this as a skeleton to then do whatever data aggregation you need.
My log file looks like this:
2013-05-26 08:22:10,583 DEBUG [jdbc.sqltiming]
16. select category0_.id as id, category0_.name as name from categories category0_
{executed in 7 msec}
LogMX is a log viewer tool that can export any log file to CSV, while parsing the date and handling multi-line log events. You can also (in its GUI) compute the time elapsed between several log events.
To do so, you first need to describe (in LogMX) your log format using a Log4j pattern or a regular expression.
PS: you can export log files from command line using this tool (console mode provided).
I was using this working pattern (logback.groovy):
to mask sensitive data. One day I needed to surround it with double quotes, like
was: password=smth
became: "password"="smth"
So I turned regexp into this (just added \" before and after keywords, and also I've tried \\"):
But I get this error on app startup:
Failed to parse pattern
Unexpected character ('?' (code 63)): was expecting comma to separate Object entries
Can someone please explain to me what am I doing wrong?
If someone wondering here is correct version:
Currently I am using the gcs-text-to-bigquery google provided template and feeding in a transform function to transform my jsonl file. The jsonl is pretty nested and i wanted to be able to output multiple rows per one row of the newline delimited json by doing some transforms.
For example:
{'state': 'FL', 'metropolitan_counties':[{'name': 'miami dade', 'population':100000}, {'name': 'county2', 'population':100000}…], 'rural_counties':{'name': 'county1', 'population':100000}, {'name': 'county2', 'population':100000}….{}], 'total_state_pop':10000000,….}
There will obviously be more counties than 2 and each state will have one of these lines. The output my boss wants is:
When i do the gcs-to-bq text transform, i end up only getting one line per state (so I'll get miami dade county from FL, and then whatever the first county is in my transform for the next state). I read a little bit and i think this is because of the mapping in the template that expects one output per jsonline. It seems I can do a pardo(DoFn ?) not sure what that is, or there is a similar option with beam.Map in python. There is some business logic in the transforms (right now it's about 25 lines of code as the json has more columns than i showed but those are pretty simple).
Any suggestions on this? data is coming in tonight/tomorrow, and there will be hundreds of thousands of rows in a BQ table.
the template i am using is currently in java, but i can translate it to python pretty easily as there are a lot of examples online in python. i know python better and i think its easier given the different types (sometimes a field can be null) and it seems less daunting given the examples i saw look simpler, however, open to either
Solving that in Python is somewhat straightforward. Here's one possibility (not fully tested):
from __future__ import absolute_import
import ast
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/service_account.json'
pipeline_args = [
pipeline_options = PipelineOptions(pipeline_args)
def jsonify(element):
return ast.literal_eval(element)
def unnest(element):
state = element.get('state')
state_pop = element.get('total_state_pop')
if state is None or state_pop is None:
for type_ in ['metropolitan_counties', 'rural_counties']:
for e in element.get(type_, []):
name = e.get('name')
pop = e.get('population')
county_type = (
'Metropolitan' if type_ == 'metropolitan_counties' else 'Rural'
if name is None or pop is None:
yield {
'State': state,
'County_Type': county_type,
'County_Name': name,
'County_Pop': pop,
'State_Pop': state_pop
with beam.Pipeline(options=pipeline_options) as p:
lines = p | ReadFromText('gs://url to file')
schema = 'State:STRING,County_Type:STRING,County_Name:STRING,County_Pop:INTEGER,State_Pop:INTEGER'
data = (
| 'Jsonify' >> beam.Map(jsonify)
| 'Unnest' >> beam.FlatMap(unnest)
| 'Write to BQ' >> beam.io.Write(beam.io.BigQuerySink(
'project_id:dataset_id.table_name', schema=schema,
This will only succeed if you are working with batch data. If you have streaming data then just change beam.io.Write(beam.io.BigquerySink(...)) to beam.io.WriteToBigQuery.
Following is what I am doing.
I am using mule MS-Dynamics connector to create a contact
I get records from mysql Database (Inserted from source file)
Transform it to CRM specific object in dataweave
This works for over 10 Million records. But for a few hundred records
I am getting the following error:
Problem writing SAAJ model to stream: Invalid white space character (0x1f) in text to output (in xml 1.1, could output as a character entity)
With some research I found out that (0x1f) represents US "Unit separator".
I tried replacing this character in my dataweave like this
%var replaceSaaj = (x) -> (x replace /\"0x1f"/ with "" default "")
but the issue persists.
I even tried to look for these characters in my source file and database with no luck.
I am aware that this connector internally uses SOAP services.
I'm using PySpark and spark-submit in order to read and manipulate CSV files with headers.
First operations are related to truncating some columns, casting to integer types, etc.
The main operation is using groupBy in order to calculate statistical measures of a column, based on another column value.
When I run my script on 1GB file, It works perfectly!
Problem is, when running it on 20GB file, it fails, as far as I can understand because of errors in groupBy.
Both files have the same format and exact same columns, e.g.:
www.google.com 20170113093210 20170113093210 150 1 ... ...
www.cnet.com 20170113114510 20170113093210 150 2 ... ...
Only the first file contain X transactions, and the 2nd contains a hell of a lot more (20GB records).
ERROR Log: (Errors start in line 32)
pastebin link for error log
My Script:
import datetime
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import mean, stddev, regexp_replace, col
sc = SparkContext('local[*]')
sqlContext = SQLContext(sc)
print ('** Script Started: %s **' % str(datetime.datetime.now())) # Analysis Start Time
print "Loading file... ",
log_df = sqlContext.read.format('csv').\
options(header='true', inferschema='true', delimiter='\t', dateFormat='yyyyMMddHHmmss').\
load("hdfs:/user/BGU/logs/01_transactions.log") # Load data file
print "Done!\nAdjusting data to fit our needs... ",
Manipulate columns to fit our needs:
size_col = 'DOWNSTREAM_SIZE'
flag_col = 'CONGESTION_FLAG'
log_df = log_df.filter(~log_df[url_col].rlike("(SNI.*)")).\
withColumn(flag_col, regexp_replace(col(flag_col), "(;.*)", "").
log_df = log_df.withColumn(size_col, log_df[size_col].cast(IntegerType()))
print "done!\n\n** %s Statistical Measures **\n" % size_col
In accordance to CONGESTION_FLAG value
log_df.cache().groupBy(flag_col).agg(mean(size_col).alias("Mean"), stddev(size_col).alias("Stddev")).\
withColumn("Variance", pow(col("Stddev"), 2)).show(3, False)
print ('** Script Ended: %s **' % str(datetime.datetime.now())) # Analysis End Time
If any more info is needed please tell me and I'll provide it.
The cause of the errors was some 'bad' records, I guess.
By adding mode='DROPMALFORMED' to the CSV parsing options,
The problem was resolved and the script completed without errors.
I'm running into some issues developing a custom function query using Solr 3.6.2.
My goal is to be able to implement a custom sorting technique.
I have a field called daily_prices_str, it is a single value str.
<str name="daily_prices_str">
2014-05-01:130 2014-05-02:130 2014-05-03:130 2014-05-04:130 2014-05-05:130 2014-05-06:130 2014-05-07:130 2014-05-08:130 2014-05-09:130 2014-05-10:130 2014-05-11:130 2014-05-12:130 2014-05-13:130 2014-05-14:130 2014-05-15:130 2014-05-16:130 2014-05-17:130 2014-05-18:130 2014-05-19:130 2014-05-20:130 2014-05-21:130 2014-05-22:130 2014-05-23:130 2014-05-24:130 2014-05-25:130 2014-05-26:130 2014-05-27:130 2014-05-28:130 2014-05-29:130 2014-05-30:130 2014-05-31:130 2014-06-01:130 2014-06-02:130 2014-06-03:130 2014-06-04:130 2014-06-05:130 2014-06-06:130 2014-06-07:130 2014-06-08:130 2014-06-09:130 2014-06-10:130 2014-06-11:130 2014-06-12:130 2014-06-13:130 2014-06-14:130 2014-06-15:130 2014-06-16:130 2014-06-17:130 2014-06-18:130 2014-06-19:130 2014-06-20:130 2014-06-21:130 2014-06-22:130 2014-06-23:130 2014-06-24:130 2014-06-25:130 2014-06-26:130 2014-06-27:130 2014-06-28:130 2014-06-29:130 2014-06-30:130 2014-07-01:130 2014-07-02:130 2014-07-03:130 2014-07-04:130 2014-07-05:130 2014-07-06:130 2014-07-07:130 2014-07-08:130 2014-07-09:130 2014-07-10:130 2014-07-11:130 2014-07-12:130 2014-07-13:130 2014-07-14:130 2014-07-15:130 2014-07-16:130 2014-07-17:130 2014-07-18:130 2014-07-19:170 2014-07-20:170 2014-07-21:170 2014-07-22:170 2014-07-23:170 2014-07-24:170 2014-07-25:170 2014-07-26:170 2014-07-27:170 2014-07-28:170 2014-07-29:170 2014-07-30:170 2014-07-31:170 2014-08-01:170 2014-08-02:170 2014-08-03:170 2014-08-04:170 2014-08-05:170 2014-08-06:170 2014-08-07:170 2014-08-08:170 2014-08-09:170 2014-08-10:170 2014-08-11:170 2014-08-12:170 2014-08-13:170 2014-08-14:170 2014-08-15:170 2014-08-16:170 2014-08-17:170 2014-08-18:170 2014-08-19:170 2014-08-20:170 2014-08-21:170 2014-08-22:170 2014-08-23:170 2014-08-24:170 2014-08-25:170 2014-08-26:170 2014-08-27:170 2014-08-28:170 2014-08-29:170 2014-08-30:170
As you can see the structure of the string is date:price.
Basically, I would like to parse the string to get the price for a particular period and sort by that price.
I’ve already developed the java plugin for the custom function query and I’m at the point where my code compiles, runs, executes, etc. Solr is happy with my code.
If I run this query I can see the correct price in the score field:
One of the problems is that I cannot sort by function result.
If I run this query:
I get a 404 saying that "sort param could not be parsed as a query, and is not a field that exists in the index: $price"
But it works with a workaround:
The main problem is that I cannot filter by range:
/select?price=sum(0,price(daily_prices_str,2015-1-1,2015-1-3))&q={!frange l=100 u=400}$price
Maybe I'm going about this totally incorrectly?
Instead of passing the newly created "price" to the "sort" parameter, can you pass the function with data itself like so?
q=*:*&sort=price(daily_prices_str,2015-01-01,2015-01-03) ...