I'm trying to generate a CSV file based on a list of objects returned by a web service method.
The problem is that I want to retrieve all of the objects available, but the call will 'fail' if I try to get more than 100 entries (the method has 2 parameters which give me the possibility to specify the interval of objects I want to retrieve, ex: from 10 to 50, from 45 to 120, etc.).
I thought of making sequential calls while incrementing the two indexes which represent the interval, but someone suggested that I should use batch processing for this. As far as I searched the internet I only found examples on how to export database data or xml files into csv, using Spring Batch.
Could someone explain me how should I handle this situation? Or at least point me to an example/tutorial similar to what I need? Thank you very much!!
If you try to load all data from a single request through a webservice , you are exposed to get a memory or timeout exception because data too much large in response, maybe you should try make some calls to your webservice, something like a paginated request, after each response you can insert response in your local database.
When all calls are over, call a process and build your csv file.
regards.
Related
Small question regarding Spark and how to read from the result of a http response please.
It is well known Spark can take as datasource some database, or CSV, etc...
sparkSession.read().format("csv").load("path/to/people.csv");
sparkSession.read().format("org.apache.spark.sql.cassandra").options(properties).load()
May I ask how to read from the result of an http call directly please?
Without having to dump the data back inside another intermediate csv / intermediate database table.
For instance the csv and database would contains millions of rows, and once read, the job needs to perform some kind of map reduce operation.
Now, the exact same data comes from the result of an http call. It is small enough for the network layer, but the information contained inside the payload is big, so I would like to apply the same map reduce.
How to read from the response of an http call please?
Thank you
You have two options for reading data in Spark:
Read directly to the driver and distribute to the executors (not scalable as everything passes through driver)
Read directly from the executors
The built in data sources like csv, parquet etc all implement reading from the executors so the job can scale with the data. They define how each partition of the data should be read - e.g. if we have 10 executors, how do you cut up the data source into 10 sections so each executor can directly read one section.
If you want to load from a HTTP request you will either have to read through the driver and distribute, which may be OK if you know the data is going to be less than ~10mb. Otherwise you would need to implement a custom data source to allow the executors to each read partition, can read here for more: https://aamargajbhiye.medium.com/speed-up-apache-spark-job-execution-using-a-custom-data-source-fd791a0fa4b0
Will finish by saying that this second option is almost definitely an anti-pattern. You will likely be much better off providing an intermediate staging environment (e.g. S3/GCS), calling the server to load the data to the intermediate store and then reading to Spark on completion. In scenario 2, you will likely end up putting too much load on the server, amongst other issues.
In previous lifetimes, I created a custom datasource. It is not the most trivial thing to do, but this GitHub repo explains it: https://github.com/jgperrin/net.jgp.books.spark.ch09.
When it comes to reading from a network stream, make sure that only one executor does it.
I have an android app which uses java rest api, which gets the data from mysql database.
After I open the app, I download a list of some items. Then, I make another 5 requests to rest api to get other connected resources, so:
First I download a list of shops.
Then, having these shops ids, I make 5 http requests to rest api and it fetches (from mysql) lists of photos (not actual images, just urls and ids), ratings, opening hours and 2 more things
This makes 6 api (and db) calls and it takes a long time.
Is there a better way to do it?
You might rewrite the multiple queries into a single JOIN and fetch them in a single network round trip.
Be sure you understand where the time is being spent and why. Your queries might be slow if you have WHERE clauses that don't use indexes. Run EXPLAIN PLAN and see if there's a TABLE SCAN being executed. If you see one, you can immediately improve performance by adding appropriate indexes on the columns in the WHERE clauses.
Why don't yout retrieve all those data in the first call? the response of restapi can be json or xml.
REQUEST: GET /shop/list
RESPONSE XML:
<shops>
<shop name="shop1" img="../../img1.jpg" ></shop>
<shop name="shop2" img="../../img2.jpg" ></shop>
<shop name="shop3" img="../../img3.jpg" ></shop>
</shops>
RESPONSE JSON:
[{name:"shop1",img:"../../img1.jpg"},
{name:"shop2",img:"../../img2.jpg"},
{name:"shop3",img:"../../img3.jpg"}]
You could handle these responses in your Java/Android application
I'm still fairly new to hibernate. I am uploading an sql script and auditing each statement in to a db. So, every statement will be saved as a string in to the database. however this file could contain up to 50,000+ statements. I've been reading up on hibernate batching, but i'm wondering what would be the best way to design and implement this.
So far, the file is uploading fine, i am creating a List out of each statement in the script, and then i save each object through hibernate individually. Obviously not great for performance!
I am wondering if i should still make a gigantic List of 50,000+ objects from the script - on controller side then pass it on to DAO, or should i parse through the file, say 100 rows at a time, and create a List of 100 objects, passing each list through to service->DAO.. and do so continuously until end of file.
How would the experts handle this design??
Thanks!
Take a look at spring-batch: with a job composed by 2 steps (file upload + data read/write) you'll solve your problem
I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper NameĀ“ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?
My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper NameĀ“ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.
Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...
Hi friends
i am generating a web crawler, i like to know some things about that,
1)Can i use Map reduce to Fetch the Data from the NET
2)Can i able to save the Fetched data to HBase?
3)Can i able to Write an App in PHP for Fetch the Data from HBase?if yes can u gave me a code snippet??How can i Adding/Viewing/Deleting Data from HBase using PHP
For your questions, yes, it can all be done. How you approach it depends on what exactly you want to achieve.
1) Your main control would need to partition the task. You would likely maintain some kind of list of addresses to crawl, possible running sequential mapreduce tasks that each time read the list in, split the list between mappers which could do the crawling, and write directly to hbase or another intermediary. They would also probably output generated urls to crawl next which in turn would be filtered down to uniques in the reduce phase, with the reduce outputting the list of things to crawl next. You'd need to maintain a list of recently crawled stuff and filter that out too, but that's not specific to MR/Hbase.
2) You can use table output format to send the outputs to hbase. You can also just make HBase connections with HTable and write directly in your mapper.
3) As TheDeveloper said, yes, with thrift. His link is good.
For questions number 3, you can interact with Hbase from PHP, but you need to do it via the Thrift interface. See this blog post for more info. Hope this helps
Can be done easily via REST using Stargate.