How to use weka tool for clustering




















We can right click the result set in the "Result list" panel and view the results of clustering in a separate window. This process and the resulting window are shown in Figures 36 and Figure 36 Figure 37 The result window shows the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters.

Cluster centroids are the mean vectors for each cluster so, each dimension value in the centroid represents the mean value for that dimension in the cluster.

Thus, centroids can be used to characterize the clusters. For example, the centroid for cluster 1 shows that this is a segment of cases representing middle aged to young approx. Another way of understanding the characteristics of each cluster in through visualization. We can do this by right-clicking the result set on the left "Result list" panel and selecting "Visualize cluster assignments".

This pops up the visualization window as shown in Figure Figure 38 You can choose the cluster number and any of the other attributes for each of the three different dimensions available x-axis, y-axis, and color. Attention reader! Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.

Previous Log odds. Next Representation of a Set. Recommended Articles. Article Contributed By :. Easy Normal Medium Hard Expert. Writing code in comment? A cross represents a correctly classified instance while squares represents incorrectly classified instances. At the lower left corner of the plot you see a cross that indicates if outlook is sunny then play the game. So this is a correctly classified instance.

To locate instances, you can introduce some jitter in it by sliding the jitter slide bar. The current plot is outlook versus play. These are indicated by the two drop down list boxes at the top of the screen. The same can be achieved by using the horizontal strips on the right hand side of the plot.

Each strip represents an attribute. Left click on the strip sets the selected attribute on the X-axis while a right click would set it on the Y-axis. There are several other plots provided for your deeper analysis. Use them judiciously to fine tune your model. Explaining the analysis in these charts is beyond the scope of this tutorial. The reader is encouraged to brush up their knowledge of analysis of machine learning algorithms.

In the next chapter, we will learn the next set of machine learning algorithms, that is clustering. A clustering algorithm finds groups of similar instances in the entire dataset. You should understand these algorithms completely to fully exploit the WEKA capabilities. As in the case of classification, WEKA allows you to visualize the detected clusters graphically.

To demonstrate the clustering, we will use the provided iris database. The data set contains three classes of 50 instances each. Each class refers to a type of iris plant. You can observe that there are instances and 5 attributes. The names of attributes are listed as sepallength , sepalwidth , petallength , petalwidth and class. The first four attributes are of numeric type while the class is a nominal type with 3 distinct values.

Examine each attribute to understand the features of the database. We will not do any preprocessing on this data and straight-away proceed to model building. Click on the Cluster TAB to apply the clustering algorithms to our loaded data. Click on the Choose button.

Now, select EM as the clustering algorithm. Click on the Start button to process the data. After a while, the results will be presented on the screen. The Cluster 0 represents setosa, Cluster 1 represents virginica, Cluster 2 represents versicolor, while the last two clusters do not have any class associated with them. If you scroll up the output window, you will also see some statistics that gives the mean and standard deviation for each of the attributes in the various detected clusters.

To visualize the clusters, right click on the EM result in the Result list. As in the case of classification, you will notice the distinction between the correctly and incorrectly identified instances. You can play around by changing the X and Y axes to analyze the results. You may use jittering as in the case of classification to find out the concentration of correctly identified instances.

The operations in visualization plot are similar to the one you studied in the case of classification. To demonstrate the power of WEKA, let us now look into an application of another clustering algorithm.

Choose the Cluster mode selection to Classes to cluster evaluation , and click on the Start button. Notice that in the Result list , there are two results listed: the first one is the EM result and the second one is the current Hierarchical.

Likewise, you can apply multiple ML algorithms to the same dataset and quickly compare their results. It was observed that people who buy beer also buy diapers at the same time. That is there is an association in buying beer and diapers together. Though this seems not well convincing, this association rule was mined from huge databases of supermarkets. Similarly, an association may be found between peanut butter and bread.

Finding such associations becomes vital for supermarkets as they would stock diapers next to beers so that customers can locate both items easily resulting in an increased sale for the supermarket.

The Apriori algorithm is one such algorithm in ML that finds out the probable associations and creates association rules. The C4. Classifiers, like filters, are organized in a hierarchy: J48 has the full name weka. The classifier is shown in the text box next to the Choose button: It reads J48 —C 0. Clustering methods are used to identify groups of similar objects in a multivariate data sets collected from fields such as marketing, bio-medical and geo-spatial.

They are different types of clustering methods , including: Partitioning methods. Hierarchical clustering. Fuzzy clustering. Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing. Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns.

Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields. In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing what groups the data points fall into when we apply a clustering algorithm. A good clustering method will produce high quality clusters in which: — the intra-class that is, intra intra- cluster similarity is high.

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.



0コメント

  • 1000 / 1000