Hi friends
i am generating a web crawler, i like to know some things about that,
1)Can i use Map reduce to Fetch the Data from the NET
2)Can i able to save the Fetched data to HBase?
3)Can i able to Write an App in PHP for Fetch the Data from HBase?if yes can u gave me a code snippet??How can i Adding/Viewing/Deleting Data from HBase using PHP
For your questions, yes, it can all be done. How you approach it depends on what exactly you want to achieve.
1) Your main control would need to partition the task. You would likely maintain some kind of list of addresses to crawl, possible running sequential mapreduce tasks that each time read the list in, split the list between mappers which could do the crawling, and write directly to hbase or another intermediary. They would also probably output generated urls to crawl next which in turn would be filtered down to uniques in the reduce phase, with the reduce outputting the list of things to crawl next. You'd need to maintain a list of recently crawled stuff and filter that out too, but that's not specific to MR/Hbase.
2) You can use table output format to send the outputs to hbase. You can also just make HBase connections with HTable and write directly in your mapper.
3) As TheDeveloper said, yes, with thrift. His link is good.
For questions number 3, you can interact with Hbase from PHP, but you need to do it via the Thrift interface. See this blog post for more info. Hope this helps
Can be done easily via REST using Stargate.
Related
Small question regarding Spark and how to read from the result of a http response please.
It is well known Spark can take as datasource some database, or CSV, etc...
sparkSession.read().format("csv").load("path/to/people.csv");
sparkSession.read().format("org.apache.spark.sql.cassandra").options(properties).load()
May I ask how to read from the result of an http call directly please?
Without having to dump the data back inside another intermediate csv / intermediate database table.
For instance the csv and database would contains millions of rows, and once read, the job needs to perform some kind of map reduce operation.
Now, the exact same data comes from the result of an http call. It is small enough for the network layer, but the information contained inside the payload is big, so I would like to apply the same map reduce.
How to read from the response of an http call please?
Thank you
You have two options for reading data in Spark:
Read directly to the driver and distribute to the executors (not scalable as everything passes through driver)
Read directly from the executors
The built in data sources like csv, parquet etc all implement reading from the executors so the job can scale with the data. They define how each partition of the data should be read - e.g. if we have 10 executors, how do you cut up the data source into 10 sections so each executor can directly read one section.
If you want to load from a HTTP request you will either have to read through the driver and distribute, which may be OK if you know the data is going to be less than ~10mb. Otherwise you would need to implement a custom data source to allow the executors to each read partition, can read here for more: https://aamargajbhiye.medium.com/speed-up-apache-spark-job-execution-using-a-custom-data-source-fd791a0fa4b0
Will finish by saying that this second option is almost definitely an anti-pattern. You will likely be much better off providing an intermediate staging environment (e.g. S3/GCS), calling the server to load the data to the intermediate store and then reading to Spark on completion. In scenario 2, you will likely end up putting too much load on the server, amongst other issues.
In previous lifetimes, I created a custom datasource. It is not the most trivial thing to do, but this GitHub repo explains it: https://github.com/jgperrin/net.jgp.books.spark.ch09.
When it comes to reading from a network stream, make sure that only one executor does it.
I am trying to do following things with high performance. I want to process one huge file with following requirement. I have one huge file like some 20,000,00L records with following format.
Format and Sample Data
STUDENT_NAME|STUDENT_ID|MOBILE_NUMBER|EMAIL_ID
Ramesh |12345 |928xxxxx |test#test.com
I have some business logic during this operation. like above sample record, i will be having 20,000,00L record in one file. During my business operation i will do update some of above records. but I don't want to use any for loop kind of thing. I should able to randomly update values.
Technology I am using
Camel & Java Spring
Please suggest me best approach for doing this successfully.
Thanks
I am creating reports on mongodb using java . So here I need to use map reduce to create reports .I am having 3 replicas in production . For reports queries I do not want make request to primary mongo database . I want to make request to only secondary replica ,So here if we use map reduce it will create a temporary collection.
1) Here is there any problem if i set read preferences as secondary
for reports using map reduce?
2) will create temporary collection on
secondary replica?
3) Is there any other way to use secondary
replica for report purpose since i do not want to create traffic on
primary database?4) will i get correct desired results since having
huge data?
Probably the easiest way to do this is to just connect to the secondary directly, instead of connecting to the Replica Set with ReadPreference.SECONDARY_ONLY. In that case, it will definitely create a temporary on the secondary and you should have the correct results (of course!).
I would also advice you to look at the Aggregation Framework though, as it's a lot faster and often easier to use and debug than Map Reduce jobs. It's not as powerful, but I have yet had to find a situation where I couldn't use the Aggregation Framework for my aggregation and reporting needs.
I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper NameĀ“ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?
My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper NameĀ“ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.
Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...
With Java, how to store around a billion of key-value pairs in a file, with a possibility of dynamically updating and querying the values whenever necessary?
If for some reason a database is out of the question, then you need to answer the following question about your problem:
What is the mix of the following operations?
Insert
Read
Modify
Delete
Search
Once you have a good guess at the ratio of these operations, try selecting the appropriate data structure for use in your file. I'd recommend starting with this book as a good catalog of options:
http://www.amazon.com/Introduction-Algorithms-Second-Thomas-Cormen/dp/0262032937
You'll want to select a data structure with the best average and worst case runtimes for your most common operations.
Good Luck
Old question, but this is a case for log files. You do not want to be copying a billion records over every time you do a delete. This can be solved by logging all "transactions" or updates to a new and separate file. These files should be broken up into reasonable sizes.
To read a tuple, you start at the newest log file until you find your key, then stop. To update or insert you just add a new record into the most recent log file. A delete is still a log entry.
A batch coalesce process needs to be run periodically which will scan each log file and write out another master. As it is read, each NEW key gets written to the new master and duplicate (old) keys are skipped until you make it all the way through. If you encounter a delete record, mark it in a separate delete list skip the record and ignore subsequent records with that key.
That made it sound simple, but remember you may want to block/chunk your file as you will likely scan said log files in reverse, or you will at least seek() to the max size and write in reverse instead of read.
I have done this exact thing with billions of lines of data. You're just re-inventing sequential access databases.
You leave out a lot of details, but...
Are the keys static? What about the values? Are they fixed size? Why not use a database?
If you don't want to use a database then use a memory mapped file.
Can you use a database? Managing such a large file would be a pain.
Edit: if the file requirement is mostly to avoid machine communication failures, downtime and similar situations, maybe you could use an embedded database. This way you would be freed from the large file manipulation problems and still use all the advantages a database can give you. I already used Apache Derby as an embedded database with wonderful results. Java DB is Oracle supported and based on Derby.