Providing static data out of web application - java

I have two machines that will be an application server in each.
The machine X is dynamic sources. The machine Y is static sources.
Thus, the user is always connected to "x.com".
When he does one upload an image, I need to send this information to "y.com".
How can I pass (at the time of upload) the byte image server x.com to save on y.com ?
See here what I started doing:
http://forum.primefaces.org/viewtopic.php?f=3&t=30239&p=96776#p96776
Balusc answered very well here:
Simplest way to serve static data from outside the application server in a Java web application
But my case is slightly diferent.
I appreciate any help!
Thank you!

I think the simplest way is to create a database table on X.com to track all images your user store on Y.com, for example:
+----------+-------------------------+
| user_id | image_path |
+----------+-------------------------+
| 0 | /images/image_xxxxx.jpg |
| 0 | /images/image_xxxxx.jpg |
| 2 | /images/image_xxxxx.jpg |
| 2 | /images/image_xxxxx.jpg |
| 3 | /images/image_xxxxx.jpg |
+----------+-------------------------+
and then serve on X.com all your images redirecting the browser to Y.com
X.com:
<img src="Y.com/images/image.xxxxx.jpg" />

Use share network disk like: samba or NFS.
Ootionaly you can consider to setup rsyncs if you have Linux/U*x hosts

Related

How to use nested Scenario Outline in Cucumber java

2suppose i have a Scenario Outline like
#Scenario1
Scenario Outline:Scenario one
Given fill up login fields "<email>" and "<password>"
And click the login button
Examples:
| email | password |
| someEmailAddress | SomePassword |
| someEmailAddress2| SomePassword2 |
and another Scenario like
#Scenario2
Scenario Outline:Scenario two
Given fill up fields "<value1>" and "<value2>"
Examples:
| value1 | value2 |
| value11 | value21 |
| value12 | value22 |
How could i run scenario like login with 'someEmailAddress' and fill up with all scenario2 value and then login with 'someEmailAddress2' and do the same.
Cucumber scenarios are tools we use to describe behaviour i.e. what is happening and why its important. They are not tools to program tests. The way to use Cucumber effectively is to keep your scenarios simple, and let code called by step definitions do your programming for you.
Step definitions and the methods they call are written in a programming language. This gives you all the power you need to deal with the details of how you interact with your system.
The art of writing Cucumber scenarios is for each one to talk about
The state we need setup so we can do something (Givens)
Our interaction (When)
What we expect to see after our interaction. (Then)
So for your scenario we have
Scenario: Login
Given I am registered
When I login
Then I should be logged in
When we make this scenario work our program has the behaviour that we can login. So then we can use that behaviour in other scenarios e.g.
Scenario: See my profile
Given I am logged in
When I view my profile
Then I should see my profile
Now to make this work we might need a bit more work because this scenario doesn't have a registered user yet. We can deal with this in a number of ways
1) Add another Given, perhaps in a background
Background:
Given I am registered
Scenario ...
Given I am logged in
2) We can register in the login step e.g.
Given "I am logged in" do
#i = register_user
login_as user: #i
end
Notice how in this step we are calling helper methods register_user and login_as to do the work for us.
This is the way to start using Cucumber. Notice how my scenarios have no mention of how we login, no email, no password, no filling in anything. To use Cucumber effectively you have to push these details down into the step definitions and the helper methods they call.
Summary
Keep you scenarios simple and use them to describe WHAT and explain WHY. Use the step definitions and helper methods to deal with HOW. There is no need to use Scenario Outlines when using Cucumber and you should never be nesting them.
There is no support for nested scenario outline in cucumber. but you can use following way to overcome it.
Scenario Outline:Scenario one and two
Given fill up login fields "<email>" and "<password>"
And click the login button
And fill up fields "<value1>" and "<value2>"
Examples:
| email | password | value1 | value2 |
| someEmailAddress | SomePassword | value11 | value21 |
| someEmailAddress | SomePassword | value12 | value22 |
| someEmailAddress2| SomePassword2 | value11 | value21 |
| someEmailAddress2| SomePassword2 | value12 | value22 |

How to specify uberization of a Hive query in Hadoop2?

There is a new feature in Hadoop 2 called uberization. For example, this reference says:
Uberization is the possibility to run all tasks of a MapReduce job in
the ApplicationMaster's JVM if the job is small enough. This way, you
avoid the overhead of requesting containers from the ResourceManager
and asking the NodeManagers to start (supposedly small) tasks.
What I can't tell is whether this just happens magically behind the scenes or does one need to do something for this to happen? For example, when doing a Hive query is there a setting (or hint) to get this to happen? Can you specify the threshold for what is "small enough"?
Also, I'm having trouble finding much about this concept - does it go by another name?
I found details in the YARN Book by Arun Murthy about "uber jobs":
An Uber Job occurs when multiple mapper and reducers are combined to use a single
container. There are four core settings around the configuration of Uber Jobs found in
the mapred-site.xml options presented in Table 9.3.
Here is table 9.3:
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.enable | Whether to enable the small-jobs "ubertask" optimization, |
| | which runs "sufficiently small" jobs sequentially within a |
| | single JVM. "Small" is defined by the maxmaps, maxreduces, |
| | and maxbytes settings. Users may override this value. |
| | Default = false. |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxmaps | Threshold for the number of maps beyond which the job is |
| | considered too big for the ubertasking optimization. |
| | Users may override this value, but only downward. |
| | Default = 9. |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxreduces | Threshold for the number of reduces beyond which |
| | the job is considered too big for the ubertasking |
| | optimization. Currently the code cannot support more |
| | than one reduce and will ignore larger values. (Zero is |
| | a valid maximum, however.) Users may override this |
| | value, but only downward. |
| | Default = 1. |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxbytes | Threshold for the number of input bytes beyond |
| | which the job is considered too big for the uber- |
| | tasking optimization. If no value is specified, |
| | `dfs.block.size` is used as a default. Be sure to |
| | specify a default value in `mapred-site.xml` if the |
| | underlying file system is not HDFS. Users may override |
| | this value, but only downward. |
| | Default = HDFS block size. |
|-----------------------------------+------------------------------------------------------------|
I don't know yet if there is a Hive-specific way to set this or if you just use the above with Hive.
An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master. So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1, then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.
SET mapreduce.job.ubertask.enable=TRUE;
So, the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.

Lucene 4.x performance issues

Over the last few weeks I've been working on upgrading an application from Lucene 3.x to Lucene 4.x in hopes of improving performance. Unfortunately, after going through the full migration process and playing with all sorts of tweaks I found online and in the documentation, Lucene 4 is running significantly slower than Lucene 3 (~50%). I'm pretty much out of ideas at this point, and was wondering if anyone else had any suggestions on how to bring it up to speed. I'm not even looking for a big improvement over 3.x anymore; I'd be happy to just match it and stay on a current release of Lucene.
<Edit>
In order to confirm that none of the standard migration changes had a negative effect on performance, I ported my Lucene 4.x version back to Lucene 3.6.2 and kept the newer API rather than using the custom ParallelMultiSearcher and other deprecated methods/classes.
Performance in 3.6.2 is even faster than before:
Old application (Lucene 3.6.0) - ~5700 requests/min
Updated application with new API and some minor optimizations (Lucene 4.4.0) - ~2900 requests/min
New version of the application ported back, but retaining optimizations and newer IndexSearcher/etc API (Lucene 3.6.2) - ~6200 requests/min
Since the optimizations and use of the newer Lucene API actually improved performance on 3.6.2, it doesn't make sense for this to be a problem with anything but Lucene. I just don't know what else I can change in my program to fix it.
</Edit>
Application Information
We have one index that is broken into 20 shards - this provided the best performance in both Lucene 3.x and Lucene 4.x
The index currently contains ~150 million documents, all of which are fairly simple and heavily normalized so there are a lot of duplicate tokens. Only one field (an ID) is stored - the others are not retrievable.
We have a fixed set of relatively simple queries that are populated with user input and executed - they are comprised of multiple BooleanQueries, TermQueries and TermRangeQueries. Some of them are nested, but only a single level right now.
We're not doing anything advanced with results - we just fetch the the scores and the stored ID fields
We're using MMapDirectories pointing to index files in a tmpfs. We played with the useUnmap "hack" since we don't open new Directories very often and got a nice boost from that
We're using a single IndexSearcher for all queries
Our test machines have 94GB of RAM and 64 logical cores
General Processing
1) Request received by socket listener
2) Up to 4 Query objects are generated and populated with normalized user input (all of the required input for a query must be present or it won't be executed)
3) Queries are executed in parallel using the Fork/Join framework
Subqueries to each shard are executed in parallel using the IndexSearcher w/ExecutorService
4) Aggregation and other simple post-processing
Other Relevant Info
Indexes were recreated for the 4.x system, but the data is the same. We tried the normal Lucene42 codec as well as an extended one that didn't use compression (per a suggestion on the web)
In 3.x we used a modified version of the ParallelMultisearcher, in 4.x we're using the IndexSearcher with an ExecutorService and combining all of our readers in a MultiReader
In 3.x we used a ThreadPoolExecutor instead of Fork/Join (Fork/Join performed better in my tests)
4.x Hot Spots
Method | Self Time (%) | Self Time (ms)| Self Time (CPU in ms)
java.util.concurrent.CountDownLatch.await() | 11.29% | 140887.219 | 0.0 <- this is just from tcp threads waiting for the real work to finish - you can ignore it
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.<init>() | 9.74% | 21594.03 | 121594
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTermsEnum$Frame.<init>() | 9.59% | 119680.956 | 119680
org.apache.lucene.codecs.lucene41.ForUtil.readBlock() | 6.91% | 86208.621 | 86208
org.apache.lucene.search.DisjunctionScorer.heapAdjust() | 6.68% | 83332.525 | 83332
java.util.concurrent.ExecutorCompletionService.take() | 5.29% | 66081.499 | 6153
org.apache.lucene.search.DisjunctionSucorer.afterNext() | 4.93% | 61560.872 | 61560
org.apache.lucene.search.Tercorer.advance() | 4.53% | 56530.752 | 56530
java.nio.DirectByteBuffer.get() | 3.96% | 49470.349 | 49470
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.<init>() | 2.97% | 37051.644 | 37051
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getFrame() | 2.77% | 34576.54 | 34576
org.apache.lucene.codecs.MultiLevelSkipListReader.skipTo() | 2.47% | 30767.711 | 30767
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.newTertate() | 2.23% | 27782.522 | 27782
java.net.ServerSocket.accept() | 2.19% | 27380.696 | 0.0
org.apache.lucene.search.DisjunctionSucorer.advance() | 1.82% | 22775.325 | 22775
org.apache.lucene.search.HitQueue.getSentinelObject() | 1.59% | 19869.871 | 19869
org.apache.lucene.store.ByteBufferIndexInput.buildSlice() | 1.43% | 17861.148 | 17861
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getArc() | 1.35% | 16813.927 | 16813
org.apache.lucene.search.DisjunctionSucorer.countMatches() | 1.25% | 15603.283 | 15603
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() | 1.12% | 13929.646 | 13929
java.util.concurrent.locks.ReentrantLock.lock() | 1.05% | 13145.631 | 8618
org.apache.lucene.util.PriorityQueue.downHeap() | 1.00% | 12513.406 | 12513
java.util.TreeMap.get() | 0.89% | 11070.192 | 11070
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs() | 0.80% | 10026.117 | 10026
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum$Frame.decodeMetaData() | 0.62% | 7746.05 | 7746
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader.iterator() | 0.60% | 7482.395 | 7482
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.seekExact() | 0.55% | 6863.069 | 6863
org.apache.lucene.store.DataInput.clone() | 0.54% | 6721.357 | 6721
java.nio.DirectByteBufferR.duplicate() | 0.48% | 5930.226 | 5930
org.apache.lucene.util.fst.ByteSequenceOutputs.read() | 0.46% | 5708.354 | 5708
org.apache.lucene.util.fst.FST.findTargetArc() | 0.45% | 5601.63 | 5601
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock() | 0.45% | 5567.914 | 5567
org.apache.lucene.store.ByteBufferIndexInput.toString() | 0.39% | 4889.302 | 4889
org.apache.lucene.codecs.lucene41.Lucene41SkipReader.<init>() | 0.33% | 4147.285 | 4147
org.apache.lucene.search.TermQuery$TermWeight.scorer() | 0.32% | 4045.912 | 4045
org.apache.lucene.codecs.MultiLevelSkipListReader.<init>() | 0.31% | 3890.399 | 3890
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() | 0.31% | 3886.194 | 3886
If there's any other information you could use that might help, please let me know.
For anyone who cares or is trying to do something similar (controlled parallelism within a query), the problem I had was that the IndexSearcher was creating a task per segment per shard rather than a task per shard - I misread the javadoc.
I resolved the problem by using forceMerge(1) on my shards to limit the number of extra threads. In my use case this isn't a big deal since I don't currently use NRT search, but it still adds unnecessary complexity to the update + slave synchronization process, so I'm looking into ways to avoid the forceMerge.
As a quick fix, I'll probably just extend the IndexSearcher and have it spawn a thread per reader instead of a thread per segment, but the idea of a "virtual segment" was brought up in the Lucene mailing list. That would be a much better long-term fix.
If you want to see more info, you can follow the lucene mailing list thread here:
http://www.mail-archive.com/java-user#lucene.apache.org/msg42961.html

Java tabled output to file

I'm interested in a way of outputting some objects to a table like way the objects. Concrete example would be something like:
-------------------------------------------
| Name | foo | bar |
-------------------------------------------
| asdas | dsfsd |1233.23 |
| adasdasd | fsdfs |3.23 |
| sdasjd | knsdfsd |13.23 |
| lkkkj | dsfsd |2343.23 |
-------------------------------------------
Or an ms office / open office excel file.(is there an api doc for this type of outputting data in specific editors? like how to define a table in OpenPffice)?
I'm asking this because I would like to know the best way doing this.
PS: there is no need to deserialise.
docx4j is a library for creating and manipulating .docx,pptx and excel files.
If you do not feel like using docx4java or it does not fit your needs you can try these
Apache Poi
Open Office API
The easiest is to export to a comma-separated values which you can open in Excel.
You can use the data-exporter library.

look for a database design related manner

I am working for a log analyzer system,which read the log of tomcat and display them by a chart/table in web page.
(I know there are some existed log analyzer system,I am recreating the wheel. But this is my job,my boss want it.)
Our tomcat log are saved by day. For example:
2011-01-01.txt
2011-01-02.txt
......
The following is my manner for export logs to db and read them:
1 The DB structure
I have three tables:
1)log_current:save the logs generated today.
2)log_past:save the logs generated before today.
The above two tables own the SAME schema.
+-------+-----------+----------+----------+--------+-----+----------+----------+--------+---------------------+---------+----------+-------+
| Id | hostip | username | datasend | method | uri | queryStr | protocol | status | time | browser | platform | refer |
+-------+-----------+----------+----------+--------+-----+----------+----------+--------+---------------------+---------+----------+-------+
| 44359 | 127.0.0.1 | - | 0 | GET | / | | HTTP/1.1 | 404 | 2011-02-17 08:08:25 | Unknown | Unknown | - |
+-------+-----------+----------+----------+--------+-----+----------+----------+--------+---------------------+---------+----------+-------+
3)log_record:save the information of log_past,it record the days whose logs have been exported to the log_past table.
+-----+------------+
| Id | savedDate |
+-----+------------+
| 127 | 2011-02-15 |
| 128 | 2011-02-14 |
..................
+-----+------------+
The table shows log of 2011-02-15 have been exported.
2 Export(to db)
I have two schedule work.
1) day work.
at 00:05:00,check the tomcat log directory(/tomcat/logs) to find all the latest 30 days log files(of course it include logs of yesterday.
check the log_record table to see if logs of one day is exported,for example,2011-02-16 is not find in the log_record,so I will read the 2011-02-16.txt,and export them to log_past.
After export log of yesterday,I start the file monitor for today's log(2011-02-17.txt) not matter it exist or not.
2)the file monitor
Once the monitor is started,it will read the file hour by hour. Each log it read will be saved in the log_current table.
3 tomcat server restart.
Sometimes we have to restart the tomcat,so once the tomcat is started,I will delete all logs of log_current,then do the day work.
4 My problem
1) two table (log_current and log_past).
Because if I save the today's log to log_past,I can not make sure all the log file(xxxx-xx-xx.txt) are exported to db. Since I will do a check in 00:05:00 every day which make sure that logs before today must be exported.
But this make it difficult to query logs accros yestersay and today.
For example,query from 2011-02-14 00:00:00 to 2011-02-15 00:00:00,these log must be at log_past.
But how about from 2011-02-14 00:00:00 to 2011-02-17 08:00:00 ?(suppose it is 2011-02-17 09:00:00 now).
It is complex to query across tables.
Also,I always think my desing for the table and work manner(schedule work of export/read) are not perfect,so anyone can give a good suggestion?
I just need to export and read log and can do a almost real-time analysis where real-time means I have to make logs of current day visiable by chart/table and etc.
First of all, IMO you don't need 2 different tables log_current and log_past. You can insert all the rows in the same table, say logs and retrieve using
select * from logs where id = (select id from log_record where savedDate = 'YOUR_DATE')
This will give you all the logs of the particular day.
Now, once you are able to remove the current and past distinction between tables using above way, I think the problem you are asking here would be solved. :)

Categories