reading CSV file from s3 using spark - java

I am new to Spark. I have a scenario where I need to read and process CSV file from AWS s3. This file is generated on daily basis, so I need to read and process it and dump data into Postgres.
I want to process this huge file in parallel to save time and memory.
I came up with two design but I am a little bit confused about spark as spark context requires connection to be open with all s3 bucket.
Use spark streaming to read CSV from s3 and process it and convert into JSON row by row and append the JSON data in JSONB column of Postgres.
Use spring & java -> download file on the server then start processing and convert it into JSON.
Can anyone help me to get the right direction?

If it's daily, and only 100MB, you don't really need much in the way of large scale tooling. I'd estimate < minute for basic download and process, even remotely, after which bomes the postgres load. Which Postgres offers
try doing this locally, with an aws s3 cp to copy to your local system, then try with postgres.
I wouldn't bother with any parallel tooling; even Spark is going to want to work with 32-64MB blocks, so you won't get more than 2-3 workers. And if the file is .gz, you get exactly one.
That said, if you want to learn spark, you could do this in spark-shell. Download locally first though, just to save time and money.

Related

Azure Functions and temporary File Storage

I'm a beginner and have never dealt with cloud-based solutions yet before, so apologies for the dumb question.
I have an Azure Blob Storage containing PDF files from which I want to extract data using PDFBox. Because PDFbox can't load blobs directly, I currently download these files locally first. However, eventually my project will need to become fully Cloud-based, preferably as an Azure Function.
The main hurdle therefore is figuring out how my Azure Function should access the files. When using the console inside my Azure Function I noticed it comes with a file storage. Can the Function download blobs and store them here before processing it? Does this file storage work the same as a local environment or are there differences to keep in mind?
I'm only looking to store files temporarily here, for only a few minutes at a time.
The main hurdle therefore is figuring out how my Azure Function should
access the files. When using the console inside my Azure Function I
noticed it comes with a file storage.
Yes, all of the information of your deployed azure function is stored in the file storage you set.(It is defined when you create the function app.)
Can the Function download blobs and store them here before processing
it? Does this file storage work the same as a local environment or are
there differences to keep in mind?
Yes, you can. And the root directory is D:/home/site/wwwroot. So if you don't specify, the file you create will be in this directory.
Remember to delete the files, because the storage space is limited. It is based on the plan you selected.
I'm only looking to store files temporarily here, for only a few
minutes at a time.
By the way, if you get a file from blob storage, at this time you have completely got its data. You can process the obtained data directly in the code without temporarily storing it in the current folder. (Of course, if you have special needs, please ignore this one.)
You can use a blob trigger or input binding to load a blob into memory of your function for processing by PDFBox.
With regards to the local file system, you can read about more about it here. From the description of your problem I think a blob trigger or input binding should be sufficient for you.

What to use : Excel VS MySql/MongoDB [Java]

I am going to make a business application for my father to make GST(Goods and Services Tax) filing easier. I have the design ready and I am going to use JavaFX.
The user will enter the data in tableview and that data needs to be stored for future reference.
The tableview needs to be converted to an excel file (gonna use Apache POI). The excel file will be sent to a C.A who will file GST on my father's behalf.
The application will need to import/export data into/from the tableview and edit the data as necessary.
I have 2 options :
Store/retrieve data from MySQL to tableview, update it according to the user's will and later export the data into excel files for sending it to C.A.
Store/retrieve data from excel files to tableview, update it according to the user's will and send the excel file to C.A.
I am planning to expand the application into a complete Business software that can manage entire business.
What should I use?
Which one will be more efficient and why?
I hope I am able to convey my question (I ain't good at writing).
In my own opinion it is more efficient and have more posibilities to explotes the data using MySQL, because reading and writing an Excel file will take a lot of time and it is slower.
I'll answer my own question, since I have got the answer.
I'm going with SQLite for now as using csv or excel files is gonna consume a lot of resources (I tried it).
I am going to sync the .db file in drive using scripts from the application itself. MySQL is definitely better choice but I want to database to be used by 2 computers at a time (not in network) so I will have to pay for online database.
I will store the .db file and drive and will retrieve it whenever the application runs. In this way its going to be safe.

Writing mapreduce output directly onto a webpage

I have a mapreduce job which writes its output to a file in HDFS. But instead of writing it to HDFS, I want the output to be written directly on a webpage. I have created a web project in eclipse and written driver, mapper and reducer classes in it. When I run it with tomcat server, it didn't work.
So how can the output be displayed on a webpage?
If you are using MAP-R distribution , you can write the output of your map reduce job to the file system (not the HDFS), but to fix your issue will require more info.
HDFS (by itself) is not really designed for low-latency random read/writes. A few options you do have however are WebHDFS / HTTPfs. This exposes a REST API to HDFS. http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.6.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html and http://hadoop.apache.org/docs/r2.4.1/hadoop-hdfs-httpfs/. You could have the webserver pull whatever file you want and serve it on the webpage. I don't think this is a very good solution however.
A better solution might be to have MapReduce output to HBase (http://hbase.apache.org/) and have your webserver pull from HBase. It is far better suited for low-latency random read / writes.

Attachments under File system vs databases?

I need to sore attachments at server side. I can store them either under blob column of database or under file directory.
My question is which one is more reliable, scalable and maintainable?
EDIT:-
if we go for file system, we have to handle synchroniztion yourself. Is n't it ? For example if two users are trying to create/update the File under same directory how will we handle concurrency with filesystem?
Storing data in directory is more reliable due to indexing and data fetch and other operation. Just store the path of the file into DB and store that file into directory.
When there's lot's of data store request came on server it's very hard and complex to handle so much request.
So it's better to store data on directory so accessing of data becomes more faster and when the daily scale of DB storage increase then these become so important so when you start any system first of all study it well and then decide that what to do or which technique will be the best ?
When more data are there in DB then clustering and indexing become more important.
If you want to use it for small data storage then blob it good option but for large data I ll not recommend you because I have made online data store web application and faced this situation so at end I have used to store data in directory and just path in DB.

Large Excel File - Upload and Import - Java

all
I want to upload and import a large excel files having more than one million records in to my Java program.
I can easily import small files using Apache POI in to my system , but when i starts with large files application throws and out of memory error,
i searched google and found many threads on so , i tried everything but could not get around of this.
can anybody give me solution for my particular problem, import time is not an issue for me, right now also i can bear with performance issue as well,
just want to import this data in to my existing system without oem error.
I have very good configuration on my system and java has enough memory to use so hardware is not an issue.
Thanks
You'll want to stream the data so that you don't need to store all the records in memory at once. POI does support streaming (see XSSF and SAX event API). As you read the data, ship it off to wherever you need to (database or wherever, you did not specify) -- with the streaming API you should not read all the data into memory before processing it.
You could also export the data to a CSV file, and then use a regular FileInputStream to read the file and process each record as it is read.

Categories