Data Science Blog

Introduction to Kafka

This is a quick cookbook to introduce Apache Kafka

July 7, 2018 Pavan Kulkarni

1 minute read

Apache Kafka is a message publishing framework that works in a distributed environment. Kafka can be scaled horizontally with high fault-tolerance.

Spark Structured Streaming - File-to-File Real-time Streaming (3/3)

CSV File to JSON File Real Time Streaming Example

June 28, 2018 Pavan Kulkarni

6 minute read

In this post we will see how to build a simple application to process file to file real time processing.

Spark Structured Streaming - Socket Word Count (2/3)

Socket Word Count demo for Spark Structured Streaming

June 20, 2018 Pavan Kulkarni

2 minute read

Structured Streaming is a new of looking at realtime streaming. In this post we will see how to build our very first Structured Streaming app to perform Word Count over network.

Spark Structured Streaming - Introduction (1/3)

A brief introduction to Spark Structured Streaming

June 14, 2018 Pavan Kulkarni

10 minute read

Structured Streaming is a new of looking at realtime streaming. With abstraction on DataFrame and DataSets, structured streaming provides alternative for the well known Spark Streaming. Structured Streaming is built on top of Spark SQL Engine. Some of the main features of Structured Streaming are -

MongoDB Data Processing (Python)

Processing data from MongoDB in Python

May 21, 2018 Pavan Kulkarni

2 minute read

This post will give an insight of data processing from MonogDB in Python.

Mongo Shell Commands - Mongo Document Queries

This is a step-by-step guide to install MongoDB on Mac

May 16, 2018 Pavan Kulkarni

10 minute read

This post will introduce mongo shell and basic query operations that can be performed on mongo shell with examples.

Spark - MongoDB Data Processing (Scala)

Processing data from Mongo on distributed environment - Apache Spark

May 16, 2018 Pavan Kulkarni

7 minute read

We will look into basic details of how to process data from MongoDB using Apache Spark.

Install MongoDB

This is a step-by-step guide to install MongoDB on Mac

May 15, 2018 Pavan Kulkarni

3 minute read

This post is a step-by-step guide to install MongoDB on Mac.

Detailed Guide to Setting up Scalable Apache Spark Infrastructure on Docker - Standalone Cluster With History Server

This post is a complete guide to build a scalable Apache Spark on using Dockers. We will see how to enable History Servers for log persistence.

May 11, 2018 Pavan Kulkarni

10 minute read

This post is a complete guide to build a scalable Apache Spark on using Dockers. We will see how to enable History Servers for log persistence. To be able to scale up and down is one of the key requirements of today’s distributed infrastructure. By the end of this guide, you should have pretty fair understanding of setting up Apache Spark on Docker and we will see how to run a sample program.

How to Setup PyCharm to Run PySpark Jobs

This post will guide you to a step-by-step setup to run PySpark jobs in PyCharm

April 27, 2018 Pavan Kulkarni

4 minute read

This post will give a walk through of how to setup your local system to test PySpark jobs. Followed by demo to run the same code using spark-submit command.

Welcome

Introduction to Kafka

Spark Structured Streaming - File-to-File Real-time Streaming (3/3)

Spark Structured Streaming - Socket Word Count (2/3)

Spark Structured Streaming - Introduction (1/3)

MongoDB Data Processing (Python)

Mongo Shell Commands - Mongo Document Queries

Spark - MongoDB Data Processing (Scala)

Install MongoDB

Detailed Guide to Setting up Scalable Apache Spark Infrastructure on Docker - Standalone Cluster With History Server

How to Setup PyCharm to Run PySpark Jobs

Pavan's Blog

Recent Posts

Introduction to Kafka

Spark Structured Streaming - File-to-File Real-time Streaming (3/3)

Spark Structured Streaming - Socket Word Count (2/3)

Spark Structured Streaming - Introduction (1/3)

Categories

About

Home

About Me

Blog

Categories

Contact

Recent Posts

Introduction to Kafka

Spark Structured Streaming - File-to-File Real-time Streaming (3/3)

Spark Structured Streaming - Socket Word Count (2/3)

Spark Structured Streaming - Introduction (1/3)