Scaling TB's of data with Apache Spark and Scala DSL at Production

Drupal\mysql\Driver\Database\mysql\Connection::open() (Line: 460) Drupal\Core\Database\Database::openConnection() (Line: 191) Drupal\Core\Database\Database::getConnection() call_user_func_array() (Line: 77) Drupal\Component\DependencyInjection\PhpArrayContainer->createService() (Line: 179) Drupal\Component\DependencyInjection\Container->get() (Line: 226) Drupal\Component\DependencyInjection\PhpArrayContainer->resolveServicesAndParameters() (Line: 60) Drupal\Component\DependencyInjection\PhpArrayContainer->createService() (Line: 179) Drupal\Component\DependencyInjection\Container->get() (Line: 576) Drupal\Core\DrupalKernel->getCachedContainerDefinition() (Line: 966) Drupal\Core\DrupalKernel->initializeContainer() (Line: 515) Drupal\Core\DrupalKernel->boot() (Line: 739) Drupal\Core\DrupalKernel->handle() (Line: 19)

Apache Spark is one of the top big-data processing platforms and has driven the adoption of Scala in many industry and academic settings. As entire Apache Spark framework has been written in Scala as a base, it’s real pleasure to understand beauty of functional Scala DSL with Spark.

This talk is intent to present :

Primary data structures (RDD, DataSet, Dataframe) usage in universal large scale data processing with Hbase (Data lake), Hive (Analytical Engine).
Case study: We will go through importance of physical data split up techniques such as coalesce, Partition, Repartition and other important spark internals in Scaling TB’s of data / ~17 billions records
Also, We will understand crucial part and very interesting way of understanding parallel & concurrent distributed data processing – tuning memory, cache, Disk I/O, Leaking memory, Internal shuffle, spark executor, spark driver etc.

Records

Quick Info

Conference

HKOSCon 2018

Event Type

Main Track

Venue

Conference Hall 4-5

Is Topic

Yes

Linux

Data Science

Distributed Computing

Parallel Computing

Apache Kafka

Sun, 06/17/2018 - 10:50 - Sun, 06/17/2018 - 11:20

Content

Language

English

Level

Intermediate

Target Audience

Developer

Audience Requriement

Targeted audience:

Who understands basic functional programming with scala or has understanding of Java.
Who understands concurrent programming or multithreading in Java / Scala.
Who has interest in distributed data processing and has keen interest in data scaling optimization.
Who has earlier worked in Big Data, Fast Data or has keen interest."

Speaker

Chetankumar Khatri

Chetan Khatri is working as a Technical Lead at Accion labs, he has diverse experience in field of Data Science and Machine learning. He is a open source contributor at Apache Spark, Apache HBase, Apache Spark - HBase Connector, Elixir Lang and many other open source projects. He has been authored curriculum of Artificial Intelligence, Data Science, Distributed computing at KSKV Kachchh University, Government of Gujarat - INDIA. He has also reviewed couple of Books with Scala Machine learning, Tensorflow Deep learning, Machine learning for Web with Packt Publication. He has delivered many talks at Pycon India 2016, PyKutch 2016, FOSSASIA 2018

Distributing Machine learning with Apache Spark - Pycon India 2016
Think Machine learning with Scikit-learn - PyKutch 2016

Open Source Contributor:

Apache Spark
Apache HBase
Apache MXNet
ParlAI
Spark HBase Connector