I'm trying to build a pipeline using Apache Beam 2.16.0 for processing large amount of XML files. Average count is seventy million per 24 hrs, and at peak load it can go up to half a billion.
File sizes varies from ~1 kb to 200 kb (sometimes it can be even bigger, for example 30 mb)
File goes through various transformations and final destination is BigQuery table for further analysis. So, first I read xml file, then deserialize into POJO (with help of Jackson) and then apply all required transformations. Transformations works pretty fast, on my machine I was able to get about 40000 transformations per second, depending on file size.
My main concern is file reading speed. I have feeling that all reading is done only via one worker, and I don't understand how this can be paralleled. I tested on 10k test files dataset.
Batch job on my local machine (MacBook pro 2018: ssd, 16 gb ram and 6-core i7 cpu) can parse about 750 files/sec. If I run this on DataFlow, using n1-standard-4 machine, I can get only about 75 files/sec. It usually doesn't scale up, but even if it does (sometimes up to 15 workers), I can get only about 350 files/sec.
More interesting is streaming job. It immediately starts from 6-7 workers and on UI I can see 1200-1500 elements/sec, but usually it doesn't show speed, and if I select last item on page, it shows that it already processed 10000 elements.
The only difference between batch and stream job is this option for FileIO:
.continuously(Duration.standardSeconds(10), Watch.Growth.never()))
Why this makes such a big difference in processing speed?
Run parameters:
--runner=DataflowRunner
--project=<...>
--inputFilePattern=gs://java/log_entry/*.xml
--workerMachineType=n1-standard-4
--tempLocation=gs://java/temp
--maxNumWorkers=100
Run region and bucket region are the same.
Pipeline:
pipeline.apply(
FileIO.match()
.withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW)
.filepattern(options.getInputFilePattern())
.continuously(Duration.standardSeconds(10), Watch.Growth.never()))
.apply("xml to POJO", ParDo.of(new XmlToPojoDoFn()));
Example of xml file:
<LogEntry><EntryId>0</EntryId>
<LogValue>Test</LogValue>
<LogTime>12-12-2019</LogTime>
<LogProperty>1</LogProperty>
<LogProperty>2</LogProperty>
<LogProperty>3</LogProperty>
<LogProperty>4</LogProperty>
<LogProperty>5</LogProperty>
</LogEntry>
Real life file and project are much more complex, with lots of nested nodes and huge amount of transformation rules.
Simplified code on GitHub: https://github.com/costello-art/dataflow-file-io
It contains only "bottleneck" part - reading files and deserializing into POJO.
If I can process about 750 files/sec on my machine (which is one powerful worker), then I expect to have about 7500 files/sec on similar 10 workers in Dataflow.
I was trying to make a test code with some functionality to check the behavior of the FileIO.match and the number of workers [1].
In this code I set the value numWorkers to 50, but you can set the value you need. What I could see is that the FileIO.match method will find all the links that match these patterns but after that, you must deal with the content of each file separately.
For example, in my case I created a method that receives each file and then I divided the content by "new_line (\n)" character (but here you can handle it as you want, it depends also on the type of file, csv, xml, ...).
Therefore, I transformed each line to TableRow, format that BigQuery understands, and return each value separately (out.output(tab)), this way, Dataflow will handle the lines in different workers depending the workload of the pipeline, for example 3000 lines in 3 different workers, each one with 1000 lines.
At the end, since it is a batch process, Dataflow will wait to process all the lines and then insert this in BigQuery.
I hope this test code helps you with yours.
[1] https://github.com/GonzaloPF/dataflow-pipeline/blob/master/java/randomDataToBQ/src/main/fromListFilestoBQ.java
Related
I'm new to Spark (although I've Hadoop and MapReduce experience) and am trying to process a giant file with a JSON record per line. I'd like to do some transformation on each line and write an output file every n records (say, 1 million). So if there are 7.5 million records in the input file, 8 output files should be generated.
How can I do this? You may provide your answer in either Java or Scala.
Using Spark v2.1.0.
You could use something like:
val dataCount = data.count
val numPartitions = math.ceil(dataCount.toDouble/100000).toInt
val newData = data.coalesce(numPartitions)
newData.saveAsTextFile("output path")
I'm on my windows gaming computer at the moment. So this code is untested, and probably contains minor errors. But in general that should work.
ref: Spark: Cut down no. of output files
As a side note, while controlling your partition size isn't a bad idea, arbitrarily deciding you want 1 million records in a partition is probably not the way to go. In general, you fiddle with partition sizes to optimize your cluster utilization.
EDIT: I should note this won't guarantee you will have a million records per partition just that you should have something in that ball park.
I have an input file of about 2 GB. It contains numbers (duplicates possible) from 1 to 9999 and are space separated. I want to read the file in small chunks (chunks of say 100000 or 20000). What approach should I take?
I am planning to process these chunks of data on different nodes in distributed fashion. I cannot use HDFS or any other file system that would chunk data automatically.
When you store that 2GB of data in the HDFS, it will be broken down into blocks. The default block size for HDFS is 64MB. You can set it to any size that you wish. For example, if you set the size to be 100MB, your data will be broken down into approximately 20 blocks.
On the other hand, when you are processing the data through MapReduce, you can decide the amount of data that you want to process by defining the number of mappers to use. You do this by setting up the split size.
For example, if you have 20 blocks of size 100MB in your HDFS like mentioned earlier, if you do not set any split size, Hadoop will figure that out for you and assign 20 mappers. But if you specify, for example the split size to be 25MB, then you will have 80 mappers processing your data.
It is important to note, this is just an example. In practical, the higher number of mappers does not mean faster processing time. You'd have to look into optimisation in order to get the best number of splits to use.
Hope this helps.
I am just trying to write huge data which is fetching from mysql db to CSV by using supercsv. How simply I can manage the performance issue. Does super csv write with some limits?
Since you included almost no detail in your question about how you are approaching the problem, it's hard to make concrete recommendations. So, here's a general one:
Unless you are writing your file to a really slow medium (some old USB stick or something), the slowest step in your process should be reading the data from the database.
There are two general ways how you can structure your program:
The bad way: Reading all the data from the database into your application's memory first and then, in a second step, writing it all in one shot to the csv file.
The right way: "Stream" the data from the db into the csv file, i.e. write the data to the csv file as it comes in to your application (record by record or batch by batch).
The idea is to set up something usually referred to as a "pipeline". Think of it like conveyor belt construction in a factory: You have multiple steps in your process of assembling some widget. What you don't want to do is have station 1 process all widgets and have stations 2 and 3 sit idle meanwhile, and then pass the whole container of widgets to station 2 to begin work, while stations 1 and 3 sit idle and so forth. Instead, station 1 needs to send small batches (1 at a time or 10 at a time or so) of widgets that are done to station 2 immediately so that they can start working on it as soon as possible. The goal is to keep all stations as busy as possible at all times.
In your example, station 1 is mysql retrieving the records, station 2 is your application that forwards (and processes?) them, and station 3 is supercsv. So, simply make sure that supercsv can start working as soon as possible, rather than having to wait for mysql to finish the entire request.
If you do this right, you should be able to generate the csv file as quickly as mysql can throw records at you*, and then, if it's still too slow, you need to rethink your database backend.
*I haven't used supercsv yet, so I don't know how well it performs, but given how trivial its job is and how popular it is, I would find it hard to believe that it would end up performing less well (as measured in processing time needed for one record) than mysql in this task. But this might be something that is worth verifying...
Description (for reference):
I want to index an entire drive of files : ~2TB
I'm getting the list of files (Using commons io library).
Once I have the list of files, I go through each file and extract readable data from that using Apache Tika
Once I have the data I'm indexing it using solr.
I'm using solrj with the java application
My question is: How do I decide what size of collection to pass to Solr. I've tried passing in different sizes with different results i.e. sometimes 150 documents per collection performs better than 100 documents but sometimes they do not. Is their an optimal way / configuration that you can tweak as this process has to be carried repeatedly.
Complications :
1) Files are stored on a network drive, retrieving the filenames/files takes some time too.
2) Both this program (java app) and solr itself cannot use more than 512MB of ram
I'll name just a few parameters of a number of them that may affect the indexing speed. Usually one needs to experiment with their own hardware, RAM, data processing complexity etc. to find the best combination, i.e. there is no single silver bullet for all.
Increase the number of segments during indexing to some large number. Say, 10k. This will make sure that merging of segments will not happen as often, as it would with the default number of segments 10. Merging the segments during the indexing contributes to slowing down the indexing. You will have to merge the segments after the indexing is complete for your search engine to perform. Also lower the number of segments back to something sensible, like 10.
Reduce the logging on your container during the indexing. This can be done using the solr admin UI. This makes the process of indexing faster.
Either reduce the frequency of auto-commits or switch them off and control the committing yourself.
Remove the warmup queries for the bulk indexing, don't auto-copy any cache entries.
Use ConcurrentUpdateSolrServer and if using SolrCloud, then CloudSolrServer.
comment out auto commit and tlogs and index on a single core. use multi threading in your solrj api (number of threads = no of cpu * 2) to hit a single core .
regards
Rajat
I challenge you :)
I have a process that someone already implemented. I will try to describe the requirements, and I was hoping I could get some input to the "best way" to do this.
It's for a financial institution.
I have a routing framework that will allow me to recieve files and send requests to other systems. I have a database I can use as I wish but it is only me and my software that has access to this database.
The facts
Via the routing framework I recieve a file.
Each line in this file follows a fixed length format with the identification of a person and an amount (+ lots of other stuff).
This file is 99% of the time im below 100MB ( around 800bytes per line, ie 2,2mb = 2600lines)
Once a year we have 1-3 gb of data instead.
Running on an "appserver"
I can fork subprocesses as I like. (within reason)
I can not ensure consistency when running for more than two days. subprocesses may die, connection to db/framework might be lost, files might move
I can NOT send reliable messages via the framework. The call is synchronus, so I must wait for the answer.
It's possible/likely that sending these getPerson request will crash my "process" when sending LOTS.
We're using java.
Requirements
I must return a file with all the data + I must add some more info for somelines. (about 25-50% of the lines : 25.000 at least)
This info I can only get by doing a getPerson request via the framework to another system. One per person. Takes between 200 and 400msec.
It must be able to complete within two days
Nice to have
Checkpointing. If im going to run for a long time I sure would like to be able to restart the process without starting from the top.
...
How would you design this?
I will later add the current "hack" and my brief idea
========== Current solution ================
It's running on BEA/Oracle Weblogic Integration, not by choice but by definition
When the file is received each line is read into a database with
id, line, status,batchfilename and status 'Needs processing'
When all lines is in the database the rows are seperated by mod 4 and a process is started per each quarter of the rows and each line that needs it is enriched by the getPerson call and status is set to 'Processed'. (38.0000 in the current batch).
When all 4 quaters of the rows has been Processed a writer process startes by select 100 rows from that database, writing them to file and updating their status to 'Written'.
When all is done the new file is handed back to the routing framework, and a "im done" email is sent to the operations crew.
The 4 processing processes can/will fail so its possible to restart them with a http get to a servlet on WLI.
Simplify as much as possible.
The batches (trying to process them as units, and their various sizes) appear to be discardable in terms of the simplest process. It sounds like the rows are atomic, not the batches.
Feed all the lines as separate atomic transactions through an asynchronous FIFO message queue, with a good mechanism for detecting (and appropriately logging and routing failures). Then you can deal with the problems strictly on an exception basis. (A queue table in your database can probably work.)
Maintain batch identity only with a column in the message record, and summarize batches by that means however you need, whenever you need.
When you receive the file, parse it and put the information in the database.
Make one table with a record per line that will need a getPerson request.
Have one or more threads get records from this table, perform the request and put the completed record back in the table.
Once all records are processed, generate the complete file and return it.
if the processing of the file takes 2 days, then I would start by implementing some sort of resume feature. Split the large file into smaller ones and process them one by one. If for some reason the whole processing should be interrupted, then you will not have to start all over again.
By splitting the larger file into smaller files then you could also use more servers to process the files.
You could also use some mass loader(Oracles SQL Loader for example) to take the large amount of data form the file into the table, again adding a column to mark if the line has been processed, so you can pick up where you left off if the process should crash.
The return value could be many small files which at the end would be combined into large single file. If the database approach is chosen you could also save the results in a table which could then be extracted to a csv file.