Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

As one of the few closed-loop payment platforms, PayPal is uniquely positioned to provide merchants with insights aimed to identify opportunities to help grow and manage their business. PayPal processes billions of data events every day around our users, risk, payments, web behavior and identity. We are motivated to use this data to enable solutions to help our merchants maximize the number of successful transactions (checkout-conversion), better understand who their customers are and find additional opportunities to grow and attract new customers.

As part of the Merchant Data Analytics, we have built a platform that serves low latency, scalable analytics and insights by leveraging some of the established and emerging platforms to best realize returns on the many business objectives at PayPal.

Join us to learn more about how we leveraged platforms and technologies like Spark, Hive, Druid, Elastic Search and HBase to process large scale data for enabling impactful merchant solutions. We’ll share the architecture of our data pipelines, some real dashboards and the challenges involved.

Today PayPal is much more than a button on a website. We have an extensive portfolio of products & services. Enabling CBT, easy Mobile & Web access, Credit Options to customers, Marketing solutions for merchants and many more help merchants grow their business and enable customers to safe digital commerce.

All of these also translate into a rich set of data that PayPal has to inform strategic and operational decisions

Concurrent Mark Sweep – If it doesn’t finish garbage collection, it starts stop the world GC. Tuned it from 10s ec to 30seconds CMSMaxAbortablePrecleanTime

https://community.hortonworks.com/questions/44950/spark-memory-issue.html org.apache.spark.shuffle.MetadataFetchFailedException Running this job with 4 cores and 200 executors. Although there could be multiple reasons for delay like skewness in data . For us it turned out that the datanode that the executor was running on was busy , a lot of times this happened with nodes with limited capacity having more number of tasks in per executor theoretically puts more pressure on the executor where if there are memory constraints the chances of having an executor failure increases metafetchfailed happens usually due to executor failure or due to executor termination

The tool was completely customizable for each project The tool was build to be schema agnostic of the table and scalable to run on datasets of large size Report was generated on Match/Mismatch count by Key Columns like Product and Geography as needed

7.
7
PayPal operates one of
the largest
PRIVATE
CLOUDSin the world*
petabytes
of data*
42markets active customer
accounts**
237M
payments in
2017**
7.6
BILLION
merchants
19Mpayments/
second at peak*
~60
0
our platform
Dedicated to with a
customer focused,
strong performance,
highly scalable,
continuously available
PLATFORM.
PayPal has one of the top five Kafka
deployments in the world, handling over
200 billion messages per day
200
+
PayPal operates one of the largest Hadoop
deployments in the world.
A 1600 Node Hadoop Cluster
with 230TB of Memory, 78PB of Storage
Running 50,000 Jobs Per day
The power of