Infragistics JQuery controls

RapidMiner tutorial: How to explore correlations in your data to discover the relevance of attributes

What is correlation?

From wikipedia

In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence.

In laymans terms, correlation is a relationships between data attributes.  For a quick refresher, in data mining, a dataset is made up of different attributes.  We use these attributes to classify or predict a label.  Some attributes have more "meaning" or influence over the label's value.  As you can imagine, if you can determine the influence that specific attributes have over your data, you are in a better position to build a classification model because you will know which attributes you should focus on when building your model.  

In this example, I will use the kaggle.com Titanic datamining challenge dataset.  This post will not uncover any information that is not readily available in the tutorial posted on kaggle.com.

Here are two screenshots.  The first screenshot will show you some statistics about the dataset.  The second screenshot will show a sample of the data.

Meta data view of the Titanic data mining challenge Training dataset

A data view of the dataset

The correlation matrix

First start by importing the Titanic training dataset into RapidMiner.  You can use Read From CSV, Read From Excel, or Read from Database to achieve this step.  Next, search for the "Correlation Matrix" operator and drag it onto the process surface.  Connect the Titanic training dataset output port to the Correlation Matrix operator's input example port.  Your process should look like this.

 

Now run the process and observe the output.

You are presented with several different result views.  The first view will be the Correlation Matrix Attribute Weights view.  The Attribute weights view displays the "weight" of each attribute.  The purpose of this tutorial is to explain a different view of the Correlation matrix.  Click on the Correlation Matrix view.  This is a matrix that shows the Correlation Coefficients which is a measure of the strength of the relationship between our attributes.  An easy way to get started with the Correlation matrix is to notice that when an attribute intersects with itself, you have a dark blue cell with the value of 1 which represents the strongest possible value.  This is because any attribute matched with itself is a perfect correlation.  A correlation coefficient value can be positive or negative.  A negative value does not necessarily mean there is less of a relationship between the values represented.  The larger the coefficient in either direction represents a strong relationship between those two attributes.  If we look at the matrix and follow along the top row (survived) we will see the attributes that have the strongest correlation with the label in which we are trying to predict.

Just as the kaggle.com tutorial specifies, the attributes with the strongest correlation with the label (survived) are

sex(0.295), pclass(0.115), and fare(0.66) 

Remember that the value as well as the color will help you to visually identify the stronger correlation between attributes.

If you are working with a classification problem, I'm sure you can see how valuable the correlation matrix can be in showing you the relationships between your label and attributes.  Such insights let can provide a great start on where to focus your attention when building your classification model.

Thanks for reading and keep your eyes open for my next tutorial! 



Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading

About the author

My name is Buddy James.  I'm a Microsoft Certified Solutions Developer from the Nashville, TN area.  I'm a Software Engineer, an author, a blogger (http://www.refactorthis.net), a mentor, a thought leader, a technologist, a data scientist, and a husband.  I enjoy working with design patterns, data mining, c#, WPF, Silverlight, WinRT, XAML, ASP.NET, python, CouchDB, RavenDB, Hadoop, Android(MonoDroid), iOS (MonoTouch), and Machine Learning. I love technology and I love to develop software, collect data, analyze the data, and learn from the data.  When I'm not coding,  I'm determined to make a difference in the world by using data and machine learning techniques. (follow me at @budbjames).  

Related links

Month List

refactorthis.net | Export Microsoft Office documents from ASP.NET applications using Infragistics NETADVANTAGE for ASP.NET
Infragistics ASP.NET controls

Export Microsoft Office documents from ASP.NET applications using Infragistics NETADVANTAGE for ASP.NET

INFRAGISTICS

I have a lot of experience developing applications using the Infragistics NETADVANTAGE for ASP.NET controls.  I've recently downloaded the latest control suite and I've decided to write a series of articles on the different controls and how they are used.  This article will be focused on creating Microsoft Word, Microsoft Excel, PDF, and XPS documents using the Infragistics NETADAVANTAGE for ASP.NET Controls.

Exporting PDF and XPS documents from the contents of the WebDataGrid

The Infragistics control suite is complete with two fully functional web grid controls.  The WebDataGrid provides a high performance, scalable ASP.NET AJAX enabled grid with built in support for sorting, filtering, and editing tabular data.  The control is designed with touch enabled devices in mind.  There is also built in support for flicking and other multi-touch gestures.  

Here is an screen shot of the WebDataGrid for your review 

 

As you can see, the grid is sleek, stylish, and very pleasing to the eye.  Infragistics controls have many predefined styles, as well as rich server side and client side  APIs. 

The second Infragistics grid control is the WebHierarchicalDataGrid.  The WebHierarchicalDataGrid shares the same functionality as the WebDataGrid as well as the ability to model master-detail and self referencing data relationships.  These relationships are represented by expandable rows that contain the related data inside of a parent row.  

Here is a screen shot of the WebHierarchicalDataGrid.

 

 

Both of the grids feature the ability to export the contents of the grid's data source to Microsoft Excel, Microsoft Word , PDF, and XPS documents.  There's also built in support for importing the contents of an Excel spreadsheet to populate the data grids.

Microsoft Office independence

One of the greatest features of the Microsoft document export functionality is the fact that there is no need to have Microsoft office installed on the server to generate the resulting documents.  The Infragistics library uses 100% managed .NET assemblies to implement this functionality.  This means there's no need to hack around the Word or Excel COM interop libraries to achieve the desired results.  Infragistics NETADVANTAGE comes with a Word Document object model as well as an Excel Woorkbook object model which provide rich APIs for creating Microsoft Office documents for use in your applications.  You can generate invoices, work orders, and receipts with very little code.

My next article will include a fully functional sample that illustrates some of the functions of the WebDataGrid as well as the document export functionality.

This concludes the article.  Thanks for reading!

You can download a trial of the entire .NET NETADVANTAGE control suite by visiting the following URL:

 http://www.infragistics.com/products/dotnet/

Infragistics website

http://www.infragistics.com/

And here are some useful videos to get you started

Export Grid Data to Excel

Export Grid Data to PDF and XPS formats

 

kick it on DotNetKicks.com



Pingbacks and trackbacks (1)+

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading

About the author

My name is Buddy James.  I'm a Microsoft Certified Solutions Developer from the Nashville, TN area.  I'm a Software Engineer, an author, a blogger (http://www.refactorthis.net), a mentor, a thought leader, a technologist, a data scientist, and a husband.  I enjoy working with design patterns, data mining, c#, WPF, Silverlight, WinRT, XAML, ASP.NET, python, CouchDB, RavenDB, Hadoop, Android(MonoDroid), iOS (MonoTouch), and Machine Learning. I love technology and I love to develop software, collect data, analyze the data, and learn from the data.  When I'm not coding,  I'm determined to make a difference in the world by using data and machine learning techniques. (follow me at @budbjames).  

Related links

Month List