Analyzing Big Data with Microsoft R Server

Learn how to use Microsoft R Server (MRS) to analyze large datasets using R, one of the most powerful programming languages.

This course is currently only available for on-site client instruction.

If you are interested in training, please contact us at [email protected]

Course Overview

The open-source programming language R has for a long time been popular (particularly in academia) for data processing and statistical analysis. Among R's strengths as a programming language are its succinctness and its extensive repository of third party libraries for performing all kinds of analyses. Together, these two features make it possible for a data scientist to very quickly go from raw data to summaries, charts, and even full-blown reports. However, one deficiency with R it is memory-bound. In other words, R needs to load the data in its entirety into memory (like any other object). This is one of the reasons R has been more reluctantly received in industry, where data sizes are usually considerably larger than in academia.

The main component of Microsoft R Server (MRS) is the RevoScaleR package. RevoScaleR is an R library that offers a set of functionalities for processing large datasets without having to load the data all at once in the memory. In addition, RevoScaleR offers a rich set of distributed statistical and machine learning algorithms, which get added to over time. Finally, RevoScaleR also offers a mechanism by which we can take code that we developed locally (such as on a laptop) and deploy it remotely (such as on SQL Server or a Spark cluster, where the underlying infrastructure is very different), with minimal effort. In this course, we will show you how to use MRS to run an analysis on a large dataset and provide some examples of how to deploy it on a Spark cluster or in-database inside SQL Server. Upon completion, you will know how to use R to solve big-data problems.

Additionally, throughout this course students will learn to think like a data scientist by learning about the steps involved in the data science cycle ( getting raw data, examining it and preparing it for analysis and modeling, running various analyses and examining the results, and finally deploying a solution.

This course covers the following

-Getting started: We have an overview of RevoScaleR and show you how to access it by downloading and installing the Microsoft R Client.We then getting the NYC Taxi data used during the course. Finally, we install the required R packages we will be using throughout the course.
-Reading the data: We talk about two different ways that RevoScaleR can handle the data and the trade-offs involved.
- Preparing the data: We examine the data and ask how we can clean it and then make it richer and more useful to the analysis. In the process, we learn how to use RevoScaleR to perform data transformations and how third-party packages can be leveraged.
- Examining the data: We now examine the data visually and through various summaries to see what does and does not mesh with our understanding of it. We look at sampling as a way to examine outliers.
- Visualizing the data: We examine ways of visualizing our results and getting a feel for the data. In the process, we learn how RevoScaleR interacts with other visualization tools.
- Clustering example: We look at k-means clustering our first RevoScaleR analytics function and look at how we can improve its performance when the data is large.
- Modeling example: We build a few predictive models and show how we can examine the predictions and compare the models. We see how our choice of the model can have performance implications.

- Deploying and scaling. We talk about RevoScaleR's write-once-deploy-anywhere philosophy and talk about what we mean by a compute context. We then take this into practice by deploying our code into SQL Server and Spark and talk about architectural differences.

Your Instructor

Jose Marcial Portilla
Jose Marcial Portilla

Jose Marcial Portilla has a BS and MS in Mechanical Engineering from Santa Clara University and years of experience as a professional instructor and trainer for Data Science and programming. He has publications and patents in various fields such as microfluidics, materials science, and data science technologies. Over the course of his career he has developed a skill set in analyzing data and he hopes to use his experience in teaching and data science to help other people learn the power of programming the ability to analyze data, as well as present the data in clear and beautiful visualizations. Currently he works as the Head of Data Science for Pierian Data Inc. and provides in-person data science and python training courses to a variety of companies all over the world, including top banks such as Credit Suisse. Feel free to contact him on LinkedIn for more information on in-person training sessions.

Get started now!