IT 632 Auburn University Main

This week our topic shifts to the classification concepts in chapter four.  Therefore, answer the following questions:

  1. What are the various types of classifiers?
  2. What is a rule-based classifier?
  3. What is the difference between nearest neighbor and naïve bayes classifiers?
  4. What is logistic regression?

Reply to at least two classmates’ responses by the date indicated. You should be actively engaging with weekly discussions by providing peer-to-peer feedback.

POST 1

Various Types of Classifiers

According to Tan et al. (2018), there are eight different types of classifiers coupled into four clusters. The classifiers are binary, multiclass, deterministic, probabilistic, linear, nonlinear, global, local, generative, and discriminative. The clustering of the classifiers occurs in pairs of two. For example, binary and multiclass, deterministic and probabilistic, etc…

Rule-based Classifier

A rule classifier is a classifier that assigns a certain value based upon a certain condition. In the programming language, they are often written using if and then statements (Tan et al., 2018). A couple of examples include:

If “Warm blooded” = yes and “flys” = no -> mammal

If “Warm blooded” = yes and “flys” = yes -> bird

As the example shows, one can make a determination based upon certain criteria or rules.

Difference Nearest Neighbor and Naïve Byes Classifiers

Nearest Neighbor and Naïve Byes classifiers are two very important classifiers – both relationship based. Still, there are very important differences between the two. The major differences include that the nearest neighbor comes to a conclusion based upon a relationship of identifiers (Puchkin & Spokoiny, 2020). For example, if it oinks like a pig, looks like a pig, it is probably a pig – conclusion based upon certain, logical criteria. Naïve Byes, on the other hand, uses probabilities, and the basic probability theory, to compute the conditional probability of an event occurring (Tan et al., 2018). For example, if there is an 80 percent chance of an event occurring and 75 percent chance of my event occurring, which is dependent on the previous event, then the overall chance is 60 percent.

Logistic Regression

Logistic regression is used to estimate the odds of a certain data instance occurring based upon certain attributes that it contains (Song et al., 2021). There are several clear characteristics associated with logistic regression, which includes discriminative model, different weights per attribute, does not involve computing densities and distances, can handle irrelevant attributes, and cannot handle data with missing values (Tan et al., 2018).

POST 2

The different types of classifiers are,

  • Perceptron
  • Naive Bayes
  • Decision Tree
  • Logistic Regression
  • K-Nearest Neighbor
  • Artificial Neural Networks/Deep Learning
  • Support Vector Machine

2. Rule based classifications are static and don’t change based on new inputs or conditions. A frequent way of building rules classifiers is to first construct a decision tree and then post-process it. For rule based, it’s a sequence of logical predicates that are executed in order (e.g. If X is true and Y or Z are false it’s a rabbit).

3. Naive Bayes assumes that each class is distributed according to a simple distribution, independent on feature basis. For continuous case, It will fit a radial Normal distribution to your whole class (each of them) and then make a decision.Nearest neighbor on the other hand is not a probabilistic model. It is simply a “smooth” version of the original idea, where you return ratio of each class in the nearest neighbors set . This assumes nothing about data distribution (besides being locally smooth).

4. Logistic regression is a statistical technique used to predict probability of binary response based on one or more independent variables. It means that, given a certain factors, logistic regression is used to predict an outcome which has two values such as 0 or 1, pass or fail, yes or no etc.

Place this order or similar order and get an amazing discount. USE Discount code “GET20” for 20% discount

Posted in Uncategorized

IT 632 Auburn University Main

In chapter 8 we focus on cluster analysis.  Therefore, after reading the chapter answer the following questions:

  1. What are the characteristics of data?
  2. Compare the difference in each of the following clustering types: prototype-based, density-based, graph-based.
  3. What is a scalable clustering algorithm?
  4. How do you choose the right algorithm?

Reply to at least two classmates’ responses by the end of the week. 

post from jazmine

What are the characteristics of data?

There are several characteristics of data that can influence cluster analyses. For example, high dimensionality reduces density of clusters (Tan et al., 2019). Large data sets also may not work well with clustering algorithms that are not scalable. Sparseness, noise in the data, and outliers also impact cluster analyses (2019). Finally, the scale of each variable particularly if the scales vary, the properties of the data space, and the types of variables (e.g. nominal, ordinal, discrete, continuous, etc.) are listed by Tan et al. (2019) as impacting cluster analyses.

Compare the difference in each of the following clustering types: prototype-based, density-based, graph-based.

Prototype-based clustering is when clusters are built based on the data point’s proximity to the prototype that defines the cluster (Tan et al., 2019). Graph-based clusters are comprised of interconnected objects; they are similar to prototype clustering in that they tend to be globular. Graph-based clusters tend to do well with irregular graphs/clusters (2019). Density-based clustering uses densely populated and sparsely populated areas to differentiate between clusters (dense areas) (Sehgal & Garg, 2014). Like graph-based clustering, density-based clustering works well with irregular and intertwined clusters. Unlike graph-based clustering, density-based does do well with noise and outliers.

What is a scalable clustering algorithm?

A scalable clustering algorithm is one that works well in increasing dimensions because it uses an appropriate amount of memory and takes an affordable amount of time. Many clustering algorithms only work well in small and medium spaces (Tan et al., 2019). Tan et al. (2019) mention two scalable clustering algorithms, CURE and BIRCH. These algorithms may use techniques to reduce computational and memory requirements such as sampling the data or first partitioning the data into disjoint sets. Other techniques include utilizing parallel and distributed computations, and summarization, which is to take one pass over the data then cluster based on the summaries (2019).

How do you choose the right algorithm?

In order to choose the right algorithm, you have to determine the best clustering technique based on shapes of the clusters, the distribution of the data, the densities of the clusters, and whether the clusters are well-separated (Tan et al., 2019). It is also important to determine is there is a relationship between the clusters and if the clusters only exist in subspaces (2019). Different clustering algorithms are suited for different data properties so for example, if a dataset has a lot of noise or many outliers, a suitable algorithm should be chosen that is flexible to these characteristics of the data. The number of data points, number of attributes, and characteristics of the data should also be considered (2019).

post from santhosh:

  1. What are the characteristics of data?

Data can be in any form of information: Data is any form of information that has been acquired and organized sensibly. Data, thus, are known facts, each of which carries an implied meaning.

Data definition and characteristics are a very important database topic, and you should know at least a basic understanding of it.

There are five qualitative data characteristics.

When you have data of varying quality, do not think that all of it is of value. Data must be of high quality to yield an optimal return. To have this work, you must have specific traits in the data. These:

Data should be exact, providing facts that are accurate and reliable. Precision saves time and money and time. Your due diligence to make sure the data you are using is credible. Dependable and consistent data Falsified data is worse than having no data at all.

  1. Compare the difference in each of the following clustering types: prototype-based, density-based, graph-based.

Grouping the data objects based on the information found in the data that describes the objects and their relationships. The goal of clustering is creating groups such that the objects within a group be like one another and different from the objects in other groups. The greater the similarity within a group and the greater the difference between groups, the better the clustering quality.

For data to be of high quality and be valuable, it must be relevant. However, in today’s dynamic data-filled world, even all necessary information is not constantly updated. Data that is tailor-made to the individual user’s requirements is exemplary. Additionally, the compound is processable, making it convenient for application.

When prototype-based, density-based, and graph-based clustering is considered, there are significant differences. There are different ways to organize data according to the data objects’ description of them and their relationships. The purpose of grouping data objects is to create distinct sets of things yet similar; as similarity within a group increases, the more significant the differences between groupings (Mingxiao, et. al 2017)

  1. What is a scalable clustering algorithm?

To process data clustering, we must do the following: collect data samples, group the data, and then fine-tune the clusters. We demonstrate that a uniform selection from the original data results in a highly representative subset in the first step. We limit ourselves to several popular parametric techniques for simplicity. We follow this with the recommendation that customers use clustering and refinement algorithms. Generalizing the issue of long-term marriages, the clustering algorithm can be described as a stable marriage solution, whereas refining with constraints is an iterative relocation strategy with regulations. Other balanced clustering techniques have an approach complexity of O(kN log N), which is similar to the approach complexity of the entire approach. When we compare the performance of the unconstrained clustering method with the proposed framework, the framework performs about the same or better. The framework has been tested on many datasets, including those with high-dimensional features (i.e., more than 20,000).

  1. How do you choose the right algorithm?

To have a better understanding of the situation, lots of data is required. Nonetheless, data availability is frequently a challenge. When the training data is short, or the dataset has fewer observations and more features like genetics or text data, use algorithms with low bias and high variances like Linear regression, Naive Bayes, or Linear SVM.

Low bias/high variance methods like KNN, Decision trees, or kernel SVM can be used when the training dataset contains many observations, and the observation count exceeds the feature count.

An accurate model will, with few exceptions, be able to accurately predict the response for an observation that is closer to the actual answer. Interpretable techniques, such as Linear Regression, have high interpretability because each predictor can be easily understood. Flexible models, such as LASSO, offer greater accuracy but at the expense of being interpretable.

Place this order or similar order and get an amazing discount. USE Discount code “GET20” for 20% discount

Posted in Uncategorized