Infragistics WPF controls

Machine Learning tutorial: How to create a decision tree in RapidMiner using the Titanic passenger data set

 

 

Greetings! And welcome to another wam bam, thank you ma'am, mind blowing, flex showing, machine learning tutorial here at refactorthis.net!

This tutorial is based on a machine learning toolkit called RapidMiner by RapidI.  RapidMiner is a full featured Java based open source machine learning toolkit with support for all of the popular machine learning algorithms used in data analytics today.  The library supports supports the following machine learning algorithms (to name a few):

  • k-NN
  • Naive Bayes (kernel)
  • Decision Tree (Weight-based, Multiway)
  • Decision Stump
  • Random Tree
  • Random Forest
  • Neural Networks
  • Perception
  • Linear Regression
  • Polynomial Regression
  • Vector Linear Regression
  • Gaussian Process
  • Support Vector Machine (Linear, Evolutionary, PSO)
  • Additive Regression
  • Relative Regression
  • k-Means (kernel, fast)
  • And much much more!!
Excited yet?  I thought so!

How to create a decision tree using RapidMiner

When I first ran across screen shots of RapidMiner online, I thought to myself, "Oh boy.. I wonder how much this is going to cost...".  The UI looked so amazing.  It's like Visual Studio for Data Mining and Machine learning!  Much to my surprise, I found out that the application is open source and free!

Here is a quote from the RapidMiner site:

RapidMiner is unquestionably the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Thousands of applications of RapidMiner in more than 40 countries give their users a competitive edge.

I've been trying some machine learning "challenges" recently to sharpen my skills as a data scientist, and I decided to use RapidMiner to tackle the kaggle.com machine learning challenge called "Titanic: Machine Learning from Disaster" .  The data set is a CSV file that contains information on many of the passengers of the infamous Titanic voyage.  The goal of the challenge is to take one CSV file containing training data (the training data contains all attributes as well as the label Survived) and a testing data file containing only the attributes (no Survived label) and to predict the Survived label of the testing set based on the training set.

Warning: Although I'm not going to provide the complete solution to this challenge, I warn you, if you are working on this challenge, then you should probably stop reading this tutorial.  I do provide some insights into the survival data found in the training data set.  It's best to try to work the challenge out on your own.  After all, we learn by TRYING, FAILING, TRYING AGAIN, THEN SUCCEEDING.  I'd also like to say that I'm going to do my very best to go easy on the THEORY of this post..  I know that some of my readers like to get straight to the action :)  You have been warned..

 

Why a decision tree?

A decision tree model is a great way to visualize a data set to determine which attributes of a data set influenced a particular classification (label).  A decision tree looks like a tree with branches, flipped upside down..  Perhaps a (cheesy) image will illustrate..

 

After you are finished laughing at my drawing, we may proceed.......  OK

In my example, imagine that we have a data set that has data that is related to lifestyle and heart disease.  Each row has a person, their sex, age, Smoker (y/n), Diet (good/poor), and a label Risk (Less Risk/More Risk).  The data indicates that the biggest influence on Risk turns out to be the Smoker attribute.  Smoker becomes the first branch in our tree.  For Smokers, the next influencial attribute happens to be Age, however, for non smokers, the data indicates that their diet has a bigger influence on the risk.  The tree will branch into two different nodes until the classification os reached or the maximum "depth" that we establish is reached.  So as you can see, a decision tree can be a great way to visualize how a decision is derived based on the attributes in your data.

RapidMiner and data modeling

Ready to see how easy it is to create a prediction model using RapidMiner?  I thought so!

Create a new process

When you are working in RapidMiner, your project is known as a process.  So we will start by running RapidMiner and creating a new process.

 

 

The version of RapidMiner used in this tutorial is version 5.3.  Once the application is open, you will be presented with the following start screen.

 From this screen you will click on New Process

 You are presented with the main user interface for RapidMiner.  One of the most compelling aspects of Rapidminer is it's ease of use and intuitive user interface.  The basic flow of this process is as follows:

  • Import your test and training data from CSV files into your RapidMiner repository.  This can be found in the repository menu under Import CSV file
  • Once your data has been imported into your repository, the datasets can be dragged onto your process surface for you to apply operators
  • You will add your training data to the process
  • Next, you will add your testing data to the process
  • Search the operators for Decision Tree and add the operator
  • In order to use your training data to generate a prediction on your testing data using the Decision Tree model, we will add an "Apply Model" operator to the process.  This operator has an input that you will associate with the output model of your Decision Tree operator.  There is also an input that takes "unlearned" data from the output of your testing dataset.
  • You will attach the outputs of Apply Model to the results connectors on the right side of the process surface.
  • Once you have designed your model, RapidMiner will show you any problems with your process and will offer "Quick fixes" if they exists that you can double click to resolve.  
  • Once all problems have been resolved, you can run your process and you will see the results that you wired up to the results side of the process surface.
  • Here are screenshots of the entire process for your review

 Empty Process

 

Add the training data from the repository by dragging and dropping the dataset that you imported from your CSV file

 

Repeat the process and add the testing data underneath the training data

Now you can search in the operators window for Decision Tree operator.  Add it to your process.

The way that you associate the inputs and outputs of operators and data sets is by clicking on the output of one item and connecting it by clicking on the input of another item.  Here we are connecting the output of the training dataset to the input of the Decision Tree operator.

 

Next we will add the Apply model operator

Then we will create the appropriate connections for the model

Observe the quick fixes in the problems window at the bottom.. you can double click the quick fixes to resolve the issues.

You will be prompted to make a simple decision regarding the problem that was detected.  Once you resolve one problem, other problems may appear.  be sure to resolve all problems so that you can run your process.

Here is the process after resolving all problems.

 

Next, I select the decision tree operator and I adjust the following parameters:

Maximum Depth: change from 20 to 5.

check both boxes to make sure that the tree is not "pruned".

Once this has been done, you can Run your process and observe the results.  Since we connected both the model as well as the labeled result to the output connectors of the process, we are presented with a visual display of our Decision Tree (model) as well as the Test data set with the prediction applied.

(Decision Tree Model)

 

(The example test result set with the predictions applied)

 

As you can see, RapidMiner makes complex data analysis and machine learning tasks extremely easy with very little effort.

This concludes my tutorial on creating Decision Trees in RapidMiner.

Until next time,

 

Buddy James

 



Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading

About the author

My name is Buddy James.  I'm a Microsoft Certified Solutions Developer from the Nashville, TN area.  I'm a Software Engineer, an author, a blogger (http://www.refactorthis.net), a mentor, a thought leader, a technologist, a data scientist, and a husband.  I enjoy working with design patterns, data mining, c#, WPF, Silverlight, WinRT, XAML, ASP.NET, python, CouchDB, RavenDB, Hadoop, Android(MonoDroid), iOS (MonoTouch), and Machine Learning. I love technology and I love to develop software, collect data, analyze the data, and learn from the data.  When I'm not coding,  I'm determined to make a difference in the world by using data and machine learning techniques. (follow me at @budbjames).  

Related links

Month List

refactorthis.net | Complete coverage of your source code with NDepend part 1
Infragistics ASP.NET controls

Complete coverage of your source code with NDepend part 1

What is NDepend?

This article is part one of a two part series about one of the most practical and dynamic tools in existence for .NET development.  I’m talking about NDepend http://www.NDepend.com.  I was approached about writing a review for NDepend so I downloaded the application to give it a try.  As with all of my reviews, let it be known that if I think a product is mediocre, then that’s what I’m going to write.  All that to say that this is no exaggeration, I really feel this strongly about this tool.  I’m sure by the end of this article, I will have peeked your interest too.  If you are interested, please read on.

NDepend pro product suite

From NDepend.com, “NDepend is a Visual Studio tool to manage complex .NET code and achieve high Code Quality.”  This tool allows you to visualize your source code in many different ways in an effort to analyze the quality of your code and how to improve it.  The product comes complete with a Visual Studio add in, an independent GUI tool, and a set of power tools that are console based which makes the product suite extremely versatile.  Whether you are pressed for time and need to analyze your code while in visual studio, you prefer a standalone GUI, or you are addicted to the command line, this product is made to fit your needs.

Installation

The NDpend installation process is very straight forward.  The download is a zip file that contains the complete product suite.  You simply pick a folder to install to and unzip the archive.  If you’ve purchased the pro version, you will be provided with a license in the form of an XML file which needs to be placed in the directory that you chose to install the product.

Installing the Visual Studio 2012 add-in

Once you’ve unzipped the archive, you need to run the NDepend.Install.VisualStudioAddin.exe executable to install the Visual Studio add-in.

Running the install

The installation completed

Adding an NDepend project to your solution

When you use the Visual Studio integration, you need to create an NDepend project in the solution that you wish to analyze.

NDepend will tell you anything that wish you know about source code.  This is powerful, however, it’s a point that must be covered.  In order to be productive with NDepend, you must first define what information that you wish to discover about your source code and how you plan to use that information.  If you don’t have this information then you will not get much use from the product.  The information that it provides to you is very useful, however, you must take some time to plan out how you will use this information to benefit you and your coding efforts.

You may wish to make sure that your code maintains a consistent amount of test coverage.  Perhaps you wish to make sure that all methods in your codebase stay below a certain threshold regarding the number of lines of code that they contain.  NDepend is capable of telling you this and much more about your source code.

One of the coolest features that I’ve seen in the product is the Code Query Linq (CQLinqing).  This allows you to query your source code using LINQ syntax to bring back anything that you wish to know about your source code.   You can query an assembly, a class, even a method.  The product comes with predefined CQLinq rules but also allows you to create your own rules as well as edit existing rules.

I plan to write another blog post that explains my personal experience with the product.  I’ve recently joined an open source project that is a framework that handles some very advanced topics such Artificial intelligence, Machine learning, and language design.  The project is called neural network designer http://bragisoft.com/ .  I chose this project because the source code is vast and I believe that a large code base is a perfect target to use NDepend to get the most benefit.

I plan to use the product and test the following areas:

  •   What information do I want to know about my code base?
  •   When do I wish to be presented with this information?
  •   How do I plan on using this information to improve my code?
  •   How can I use NDepend to provide this information?

I think that if you wish to get any use out of the product, it will be very important that you answer these questions.  The product is vast and diverse but it can also be a bit intimidating.  With that said, I plan to use my next post to illustrate how I was able to use NDepend to define the metrics that I needed from my code, and how I used NDepend to provide those metrics to me.

Stay tuned for the next installment which will explain my experience with using NDepend to improve my development efforts and my source code.

Thanks for reading,

Buddy James

kick it on DotNetKicks.com



Comments (1) -

Jan Bogaerts
Jan Bogaerts
2/9/2013 10:29:56 AM #

Love the first part. Looking forward to the next.

Pingbacks and trackbacks (1)+

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading

About the author

My name is Buddy James.  I'm a Microsoft Certified Solutions Developer from the Nashville, TN area.  I'm a Software Engineer, an author, a blogger (http://www.refactorthis.net), a mentor, a thought leader, a technologist, a data scientist, and a husband.  I enjoy working with design patterns, data mining, c#, WPF, Silverlight, WinRT, XAML, ASP.NET, python, CouchDB, RavenDB, Hadoop, Android(MonoDroid), iOS (MonoTouch), and Machine Learning. I love technology and I love to develop software, collect data, analyze the data, and learn from the data.  When I'm not coding,  I'm determined to make a difference in the world by using data and machine learning techniques. (follow me at @budbjames).  

Related links

Month List