Infragistics ASP.NET controls

RapidMiner tutorial: How to explore correlations in your data to discover the relevance of attributes

What is correlation?

From wikipedia

In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence.

In laymans terms, correlation is a relationships between data attributes.  For a quick refresher, in data mining, a dataset is made up of different attributes.  We use these attributes to classify or predict a label.  Some attributes have more "meaning" or influence over the label's value.  As you can imagine, if you can determine the influence that specific attributes have over your data, you are in a better position to build a classification model because you will know which attributes you should focus on when building your model.  

In this example, I will use the kaggle.com Titanic datamining challenge dataset.  This post will not uncover any information that is not readily available in the tutorial posted on kaggle.com.

Here are two screenshots.  The first screenshot will show you some statistics about the dataset.  The second screenshot will show a sample of the data.

Meta data view of the Titanic data mining challenge Training dataset

A data view of the dataset

The correlation matrix

First start by importing the Titanic training dataset into RapidMiner.  You can use Read From CSV, Read From Excel, or Read from Database to achieve this step.  Next, search for the "Correlation Matrix" operator and drag it onto the process surface.  Connect the Titanic training dataset output port to the Correlation Matrix operator's input example port.  Your process should look like this.

 

Now run the process and observe the output.

You are presented with several different result views.  The first view will be the Correlation Matrix Attribute Weights view.  The Attribute weights view displays the "weight" of each attribute.  The purpose of this tutorial is to explain a different view of the Correlation matrix.  Click on the Correlation Matrix view.  This is a matrix that shows the Correlation Coefficients which is a measure of the strength of the relationship between our attributes.  An easy way to get started with the Correlation matrix is to notice that when an attribute intersects with itself, you have a dark blue cell with the value of 1 which represents the strongest possible value.  This is because any attribute matched with itself is a perfect correlation.  A correlation coefficient value can be positive or negative.  A negative value does not necessarily mean there is less of a relationship between the values represented.  The larger the coefficient in either direction represents a strong relationship between those two attributes.  If we look at the matrix and follow along the top row (survived) we will see the attributes that have the strongest correlation with the label in which we are trying to predict.

Just as the kaggle.com tutorial specifies, the attributes with the strongest correlation with the label (survived) are

sex(0.295), pclass(0.115), and fare(0.66) 

Remember that the value as well as the color will help you to visually identify the stronger correlation between attributes.

If you are working with a classification problem, I'm sure you can see how valuable the correlation matrix can be in showing you the relationships between your label and attributes.  Such insights let can provide a great start on where to focus your attention when building your classification model.

Thanks for reading and keep your eyes open for my next tutorial! 



新增評論

  Country flag

biuquote
  • 評論
  • 即時預覽
Loading

About the author

My name is Buddy James.  I'm a Microsoft Certified Solutions Developer from the Nashville, TN area.  I'm a Software Engineer, an author, a blogger (http://www.refactorthis.net), a mentor, a thought leader, a technologist, a data scientist, and a husband.  I enjoy working with design patterns, data mining, c#, WPF, Silverlight, WinRT, XAML, ASP.NET, python, CouchDB, RavenDB, Hadoop, Android(MonoDroid), iOS (MonoTouch), and Machine Learning. I love technology and I love to develop software, collect data, analyze the data, and learn from the data.  When I'm not coding,  I'm determined to make a difference in the world by using data and machine learning techniques. (follow me at @budbjames).  

Related links

Month List

refactorthis.net | Learning ILAsm, the backbone of .NET
Infragistics JQuery controls

Learning ILAsm, the backbone of .NET

Who cares about ILAsm or x86 assembly language anyway?

I'm sure a lot of you are wondering why anyone would care about learning ILAsm.  It's not like you ever see it unless you disassemble an application.  ILasm or MSIL is the human readable translation of Microsoft .NET intermediate language.  ILAsm is a lot like classic assembly language.  It is a low level programming language that allows you to write programs one instruction at a time with a very minimal syntax.  I've explained the benefits of learning assembly language in my previous post, Why Learn Assembly Language.  In a nutshell, if you learn .NET at the low level of IL, you will have an understanding of what makes any .NET language tick.  You will have the knowledge to disassemble any .NET binary and debug your software at the instruction level.  

This post is my first tutorial on writing code in ILAsm.  I hope you'll join me and become proficient in this language of kings.  You will have an edge over the competition and it will change the way you look at high level coding.  Enough talking, let us code!

How to compile ILasm

If you are a .NET developer, you most likely have an ILasm compiler on your computer which by the way is appropriately named ILasm.exe.  Simply launch the Visual Studio Command Prompt, navigate to the folder that you create your IL files and issue the following command.

ilasm.exe is fine if you like to write code in notepad and drop out to a command prompt to write your code.  I myself prefer an IDE.  Visual Studio is an amazing IDE but for some reason the people over at Microsoft didn't see it fit to add support for ILasm into the product.  Never fear!  There is a free, open source alternative that is nearly identical to Visual Studio and it allows you to create and compile IL projects with syntax highlighting that works on Linux, Windows, and Mac OSX!  What is this application you ask?

MonoDevelop  

"Hello... World?"

I know, I know, it's a tired, worn out cliche but far be it from me to interrupt the order of the programming gods and illustrate a programming language without starting with the infamous "Hello World!" example.

//import the mscorlib assembly to give us access to Console and other basic classes
.assembly extern mscorlib{}

//define our assembly
.assembly HelloAssembly
{
    //define the version of this assembly
    .ver 1:0:0:0
}

//define the executable module
.module helloworld.exe

//defin our main method
.method static void main() cil managed
{
    //set up the stack.  In ILAsm, all values are placed on the stack and then manipulated.
    //here we will allocate memory for one value to be on the stack at a given time
    .maxstack 1

    //define the main entry point to the application
    .entrypoint

    //load the emphamis phrase on the stack
    ldstr "Hello ILAsm!"
    
    //print the string from the stack to the console
    call void [mscorlib]System.Console::WriteLine(string)

   //return to end the program
   ret

}

If you have never seen asm or ILasm, I can imagine how strange this code snippet may look.  As I've stated before, ILasm is a very cryptic, low level language.  Let's breakdown the application.

We start by importing the mscorlib library which contains much of the base .NET classes.  As the comments state, this library gives us access to the System.Console object.  

Next, we define the assembly for our program.  All .NET executables are called assemblies.  Here we name our assembly as well as set a version number.  After we define our assembly, we define the executable module.  This is required in any ILasm application.  

Now it's time to define our main method that performs the loading and printing of the string.  We start by defining the maxstack, that is, the maximum number of values that can be held in memory at a given time.  In ILAsm, you push values on the stack, perform operations on the values or use them as parameters to methods.  Since we have a maxstack of one, that means we can have only one piece of data to work with at any given time.  We use ldstr to load a string onto the stack.  If we were to load another string on the stack directly after the first ldstr call, the application will simply push the first value out of memory and the new string will be available to access.

Finally, we call the WriteLine method on the System.Console object and we tell it to use the current string on the stack as it's input source.

 

So now you can load a string onto the stack and display it.  It's pretty interesting, although very limited as well.  How about we work with more than one value?  Let's try adding two numbers!

Sum it up

//reference to mscorlib
.assembly extern mscorlib {}

//define our assembly
.assembly MiniCalculator
{
//the assembly version number
    .ver 1:0:0:0
}

//create the required module
.module MiniCalculator.exe

//define our main method
.method static void main() cil managed
{
    //we plan to work with two integers this time
    .maxstack 2

    //the main entry point to our application 
    .entrypoint
    
    //load a string of instructions on the stack
    ldstr "OK.  Class is in session.  Who can tell the class what is the sum of 2 + 2? That's right, the answer is  "
    
    //display the instructions to the user
    call void [mscorlib]System.Console::Write (string)
    
    //put the number 2 on the stack.  Currently the previously loaded string, and the 
    //number 2 are both on the stack.
    ldc.i4 2
    //when we move another integer to the stack, this pushes the string off
    //now we have two instances of the number 2 on the stack
    ldc.i4 2
    //add will add the two numbers on the stack and store the result     
    add
    //lets tell the computer to look for an int on the stack and print it to the console
    call void [mscorlib]System.Console::Write (int32)
    //return to exit the application
    ret
}

We start the application as before by importing mscorlib, defining our assembly and module, and creating the method to perform our work.  We then load a string on the stack and use Console's WriteLine method to display the string.  We then load two integers onto the stack, pushing the string off of the stack.  We call add whichs adds the two integers and stores the result on the stack.  We use Console's WriteLine once again to display the answer.

This concludes part one of my ILasm tutorial series. Please check back soon for my next installment in which we will tackle data types, loops, and classes!

Until next time..

~/Buddy James

kick it on DotNetKicks.com  



Pingbacks and trackbacks (1)+

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading

About the author

My name is Buddy James.  I'm a Microsoft Certified Solutions Developer from the Nashville, TN area.  I'm a Software Engineer, an author, a blogger (http://www.refactorthis.net), a mentor, a thought leader, a technologist, a data scientist, and a husband.  I enjoy working with design patterns, data mining, c#, WPF, Silverlight, WinRT, XAML, ASP.NET, python, CouchDB, RavenDB, Hadoop, Android(MonoDroid), iOS (MonoTouch), and Machine Learning. I love technology and I love to develop software, collect data, analyze the data, and learn from the data.  When I'm not coding,  I'm determined to make a difference in the world by using data and machine learning techniques. (follow me at @budbjames).  

Related links

Month List