Articles & Content

The Analytics Universe

Abhik Roy has 15 years of experience in the development and architecture of data solutions and platforms optimized for processing massive volumes of data. He is specialized in architecture and build of data products for data science and analytic s teams leveraging Big Data, distributed computing and parallel computing / shared nothing methodologies. His technology specializations include Mongo, Netezza, Hadoop , DB2, IBM Big Insights, Apache Spark, etc.

 

 
How do you SHARE your vision of Netezza and R integrationAre you in the boat where you are trying to build a truly Scalable, Self-service  Data Science platform with the integration of R and Netezza? If so, then this SHARE strategy will let you plan and keep yourself focused to make this initiative a success. S – Situation,  H – Hindrance A – Action,  R – Result E- Evaluate  S – Situation The last decade saw a rapid advancement in the development of cost effective distributed computing and parallel processing technologies. This has enabl…
Apache Spark Streaming Deep Dive – Part2In the earlier blog, Apache Spark Streaming Deep Dive Part 1, we saw the reference architecture of Spark Streaming and got an understanding of the building blocks needed to build streaming applications. In this blog, we will look at a simple code example that leverages the Streaming Architecture. Refer to the below link, for part1 of this series. http://www.theanalyticsuniverse.com/apache-spark-streaming-deep-dive-part1 Problem Statement: We shall look at a simple code that will have a micro bat…
Apache Spark Streaming Deep Dive – Part1In this and the next couple of blog posts, we would be looking at Apache Spark Streaming. We will begin with the reference architecture of Streaming and get an understanding of the building blocks needed to build streaming applications. Then we would dive deeper and look at some code that would be able to accomplish stream analytic s. What is Apache Spark Streaming -- In simplest terms, Streaming lets you perform analytic s on real time data. By real time data, I mean data that is arriving in st…
Understanding Data Types and Data Structures in Scala and Apache SparkIn this blog post, I would provide a brief overview of the data structures and data types available in the Scala Programming Language. Machine learning algorithms using Apache Spark and the ML package uses complex data structures like vectors, arrays etc. So developing an understanding of these becomes a vital element if we are interested in building machine learning algorithms in Apache Spark. Data Types in Scala Just like any programming language, Scala supports Byte, Short, Int, Long, Float, …
Data Summit 2016 Conference Abstract – Using Apache Spark to build Scaleable Machine Learning ApplicationsThe 2016 Data Summit Conference at New York is so far a huge success and it is a privilege for me to get an opportunity to present an abstract in the conference. I was also luckly enough to be at the round-table on Enterprise Data Lakes I will soon be posting my notes on the round-table discussions on Enterprise Data Lakes. Meanwhile, I am posting my presentation abstract below. Feel free to download it and use the examples  and codes in the presentation. And if you like it and feel you want to…
Predictive Analytic s – Naive Bayes using OpenR and IBM NetezzaNaive Bayes In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. More details on Naive Bayes available at ttps://en.wikipedia.org/wiki/Naive_Bayes_classifier Problem Description We will look at the Iris Plants Data-set (from UCI Repository) and this is perhaps the best known data-set to be found in the pattern recognition literature. The data-set contains …
Hear me Speak at Data Summit 2016 http://www.dbta.com/DataSummit/Speakers/Abhik-Roy.aspx The post Hear me Speak at Data Summit 2016 appeared first on theanalyticsuniverse.com.
Collaborative Filtering using Apache SparkA collaborative filtering process could be implemented using Apache Spark. In simplest terms,  collaborative filtering is a technique to recommend similar entities by studying the characteristics of each of the entities in the collection set. Example, recommendations on similar cities, similar movies etc. In this blog post, I will present a very simplified Apache Spark and Python program to demonstrate a use case. Sample data set Below is a small sample from the dataset, which is in a file name…
Breadth First Search Algorithm (BFS) in Apache SparkBreadth-first search (BFS) is an algorithm for traversing or searching a tree or graph data structures. In this blog post, we would take a quick tour of the algorithm and see how it could be implemented in Apache Spark. Let’s say, we want to know the degree of connection between O and M. O is connected in first degree to P O is connected to V and C, through P. So V and C are second degree connections to O O is connected to M, through P and C, hence M is a 3rd degree connection of O. Refer to t…
Understanding Complex relationships using Apache SparkApache Spark could be used to process graph or relation data. To illustrate a particular use case, consider a grocery store sales data. We are interested to identify the  grocery item that is most popular. We define popularity by the number of times the item was purchased along with other items.  Consider the below illustration. Here, we see cheese was bought along with milk, fish and carrot, but flour was bought only with cheese. Hence, based on our ground rules on popularity, we can say che…