Based on your Module topics, what did you find new and interesting? And what appeared to be a review?

Business Analytics

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Descriptive Data Mining

Chapter 5

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Introduction (Slide 1 of 2)

The increase in the use of data-mining techniques in business has been caused largely by three events:

The explosion in the amount of data being produced and electronically tracked.

The ability to electronically warehouse these data.

The affordability of computer power to analyze the data.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Introduction (Slide 2 of 2)

Observation: Set of recorded values of variables associated with a single entity.

Unsupervised learning: A descriptive data-mining technique used to identify relationships between observations.

Thought of as high-dimensional descriptive analytics.

There is no outcome variable to predict; instead, qualitative assessments are used to assess and compare the results.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

4

Cluster Analysis

Measuring Similarity Between Observations

Hierarchical Clustering

k-Means Clustering

Hierarchical Clustering versus k-Means Clustering

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 1 of 21)

Goal of clustering is to segment observations into similar groups based on observed variables.

Can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration.

Commonly used in marketing to divide customers into different homogenous groups; known as market segmentation.

Used to identify outliers.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 2 of 21)

Clustering methods:

Bottom-up hierarchical clustering starts with each observation belonging to its own cluster and then sequentially merges the most similar clusters to create a series of nested clusters.

k-means clustering assigns each observation to one of k clusters in a manner such that the observations assigned to the same cluster are as similar as possible.

Both methods depend on how two observations are similar—hence, we have to measure similarity between observations.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 3 of 21)

Measuring Similarity Between Observations:

When observations include numeric variables, Euclidean distance is the most common method to measure dissimilarity between observations.

measurements of q variables.

The Euclidean distance between observations u and v is:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 4 of 21)

Measuring Similarity Between Observations:

Illustration:

KTC is a financial advising company that provides personalized financial advice to its clients.

KTC would like to segment its customers into several groups (or clusters) so that the customers within a group are similar and dissimilar with respect to key characteristics.

For each customer, KTC has an observation of seven variables: Age, Female, Income, Married, Children, Car Loan, Mortgage.

Example: The observation u = (61, 0, 57881, 1, 2, 0, 0) corresponds to a 61-year-old male with an annual income of $57,881, married with two children, but no car loan and no mortgage.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

9

Cluster Analysis (Slide 5 of 21)

Figure 5.1: Euclidean Distance

Euclidean distance becomes smaller as a pair of observations become more similar with respect to their variable values.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Figure 4.1 depicts Euclidean distance for two observations consisting of two variable measurements.

Euclidean distance is highly influenced by the scale on which variables are measured.

Therefore, it is common to standardize the units of each variable j of each observation u;

Example: uj, the value of variable j in observation u, is replaced with its z-score, zj.

The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean distance between observations.

10

Cluster Analysis (Slide 6 of 21)

Euclidean distance is highly influenced by the scale on which variables are measured:

Common to standardize the units of each variable j of each observation u.

The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean distance between observations.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

11

Cluster Analysis (Slide 7 of 21)

When clustering observations solely on the basis of categorical variables encoded as 0–1, a better measure of similarity between two observations can be achieved by counting the number of variables with matching values.

The simplest overlap measure is called the matching coefficient and is computed as:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 8 of 21)

A weakness of the matching coefficient is that if two observations both have a 0 entry for a categorical variable, this is counted as a sign of similarity between the two observations.

To avoid misstating similarity due to the absence of a feature, a similarity measure called Jaccard’s coefficient does not count matching zero entries and is computer as:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 9 of 21)

Table 5.1: Comparison of Similarity Matrixes for Observations with Binary Variables

Observation Female Married Loan Mortgage

1 1 0 0 0

2 0 1 1 1

3 1 1 1 0

4 1 1 0 0

5 1 1 0 0

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

14

Cluster Analysis (Slide 10 of 21)

Table 5.1: Comparison of Similarity Matrixes for Observations with Binary Variables (cont.)

Similarity Matrix Based on Matching Coefficient:

Observation 1 2 3 4 5

1 1

2 0 1

3 0.5 0.5 1

4 0.75 0.25 0.75 1

5 0.75 0.25 0.75 1 1

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

15

Cluster Analysis (Slide 11 of 21)

Table 5.1: Comparison of Similarity Matrixes for Observations with Binary Variables (cont.)

Similarity Matrix Based on Jaccard’s Coefficient:

Observation 1 2 3 4 5

1 1

2 0 1

3 0.333 0.5 1

4 0.5 0.25 0.667 1

5 0.5 0.25 0.667 1 1

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

16

Cluster Analysis (Slide 12 of 21)

Hierarchical Clustering:

Determines the similarity of two clusters by considering the similarity between the observations composing either cluster.

Starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster.

Given a way to measure similarity between observations, there are several clustering method alternatives for comparing observations in two clusters to obtain a cluster similarity measure:

Single linkage.

Complete linkage.

Group average linkage.

Median linkage.

Centroid linkage.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 13 of 21)

Single linkage: The similarity between two clusters is defined by the similarity of the pair of observations (one from each cluster) that are the most similar.

Complete linkage: This clustering method defines the similarity between two clusters as the similarity of the pair of observations (one from each cluster) that are the most different.

Group Average linkage: Defines the similarity between two clusters to be the average similarity computed over all pairs of observations between the two clusters.

Median linkage: Analogous to group average linkage except that it uses the median of the similarities computer between all pairs of observations between the two clusters.

Centroid linkage uses the averaging concept of cluster centroids to define between-cluster similarity.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Single linkage will consider two clusters to be close if an observation in one of the clusters is close to at least one observation in the other cluster.

Complete linkage will consider two clusters to be close if their most-different pair of observations are close. This method produces clusters such that all member observations of a cluster are relatively close to each other.

18

Cluster Analysis (Slide 14 of 21)

Figure 5.2: Measuring Similarity Between Clusters

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 15 of 21)

Ward’s method merges two clusters such that the dissimilarity of the observations with the resulting single cluster increases as little as possible.

When McQuitty’s method considers merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C is calculated as: ((dissimilarity between A and C) + (dissimilarity between B and C)) divided by 2).

A dendrogram is a chart that depicts the set of nested clusters resulting at each step of aggregation.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 16 of 21)

Figure 5.3: Dendrogram for KTC Using Matching Coefficients and Group Average Linkage

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 17 of 21)

k-Means Clustering:

Given a value of k, the k-means algorithm randomly assigns each observation to one of the k clusters.

After all observations have been assigned to a cluster, the resulting cluster centroids are calculated.

Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

The algorithm repeats this process (calculate cluster centroid, assign observation to cluster with nearest centroid) until there is no change in the clusters or a specified maximum number of iterations is reached.

One rule of thumb is that the ratio of between-cluster distance to within-cluster distance should exceed 1.0 for useful clusters.

22

Cluster Analysis (Slide 18 of 21)

Figure 5.4: Clustering Observations by Age and Income Using

k-Means Clustering with k = 3

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

To illustrate k-means clustering, we consider a 3-means clustering of a small sample of KTC’s customer data in the file DemoKTC.

Figure 4.4 shows three clusters based on customer income and age.

Cluster 1 is characterized by relatively younger, lower-income customers (Cluster 1’s centroid is at [33, $20,364]).

Cluster 2 is characterized by relatively older, higher-income customers (Cluster 2’s centroid is at [58, $47,729]).

Cluster 3 is characterized by relatively older, lower-income customers (Cluster 3’s centroid is at [53, $21,416]).

23

Cluster Analysis (Slide 19 of 21)

Table 5.2: Average Distances Within Clusters

No. of Observations Average Distance Between Observations in Cluster

Cluster 1 12 0.622

Cluster 2 8 0.739

Cluster 3 10 0.520

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Table 4.2 shows that Cluster 2 is the smallest, most heterogeneous cluster, whereas Cluster 1 is the largest cluster, and Cluster 3 is the most homogeneous cluster.

In Table 4.3, we compare the average distances between clusters to the average distance within clusters in Table 4.2.

Cluster 1 and Cluster 2 are the most distinct from each other.

Cluster 2 and Cluster 3 are the least distinct from each other.

Comparing the distance between the Cluster 2 and Cluster 3 centroids (1.964) to the average distance between observations within Cluster 2 (0.739), suggests that there are observations within Cluster 2 that are more similar to those in Cluster 3 than to those in Cluster 2.

24

Cluster Analysis (Slide 20 of 21)

Table 5.3: Distances Between Cluster Centroids

Cluster 1 Cluster 2 Cluster 3

Cluster 1 0 2.784 1.529

Cluster 2 2.784 0 1.964

Cluster 3 1.529 1.964 0

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Table 4.2 shows that Cluster 2 is the smallest, most heterogeneous cluster, whereas Cluster 1 is the largest cluster, and Cluster 3 is the most homogeneous cluster.

In Table 4.3, we compare the average distances between clusters to the average distance within clusters in Table 4.2.

Cluster 1 and Cluster 2 are the most distinct from each other.

Cluster 2 and Cluster 3 are the least distinct from each other.

Comparing the distance between the Cluster 2 and Cluster 3 centroids (1.964) to the average distance between observations within Cluster 2 (0.739), suggests that there are observations within Cluster 2 that are more similar to those in Cluster 3 than to those in Cluster 2.

25

Cluster Analysis (Slide 21 of 21)

Hierarchical Clustering versus k-Means Clustering

Hierarchical Clustering k-Means Clustering

Suitable when we have a small data set (e.g., fewer than 500 observations) and want to easily examine solutions with increasing numbers of clusters. Suitable when you know how many clusters you want and you have a larger data set (e.g., more than 500 observations).

Convenient method if you want to observe how clusters are nested. Partitions the observations,

which is appropriate if trying to summarize the data with k “average” observations

that describe the data with the minimum amount of error.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Because Euclidean distance is the standard metric for k-means clustering, it is generally not as appropriate for binary or ordinal data for which an “average” is not meaningful.

26

Association Rules

Evaluating Association Rules

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Association Rules (Slide 1 of 7)

Association rules: If-then statements which convey the likelihood of certain items being purchased together.

Although association rules are an important tool in market basket analysis, they are also applicable to other disciplines.

Antecedent: The collection of items (or item set) corresponding to the if portion of the rule.

Consequent: The item set corresponding to the then portion of the rule.

Support count of an item set: Number of transactions in the data that include that item set.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

28

Association Rules (Slide 2 of 7)

Table 5.4: Shopping-Cart Transactions

Transaction Shopping Cart

1 bread, peanut butter, milk, fruit, jelly

2 bread, jelly, soda, potato chips, milk, fruit, vegetables, peanut butter

3 whipped cream, fruit, chocolate sauce, beer

4 steak, jelly, soda, potato chips, bread, fruit

5 jelly, soda, peanut butter, milk, fruit

6 jelly, soda, potato chips, milk, bread, fruit

7 fruit, soda, potato chips, milk

8 fruit, soda, peanut butter, milk

9 fruit, cheese, yogurt

10 yogurt, vegetables, beer

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hy-Vee grocery store would like to gain insight into its customers’ purchase patterns to possibly improve its in-aisle product placement and cross-product promotions.

Table 4.4 contains a small sample of data where each transaction comprises the items purchased by a shopper in a single visit to a Hy-Vee.

An example of an association rule from this data would be “if {bread, jelly}, then {peanut butter}” meaning that “if a transaction includes bread and jelly, then it also includes peanut butter.”

Antecedent – {bread, jelly},

Consequent – {peanut butter}

The potential impact of an association rule is often governed by the number of transactions it may affect, which is measured by computing the support count of the item set consisting of the union of its antecedent and consequent.

Investigating the rule “if {bread, jelly}, then {peanut butter}” from Table 4.4, we see the support count of {bread, jelly, peanut butter} is 2.

29

Association Rules (Slide 3 of 7)

Confidence: Helps identify reliable association rules:

Lift ratio: Measure to evaluate the efficiency of a rule:

For the data in Table 5.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

This measure of confidence can be viewed as the conditional probability of the consequent item set occurs given that the antecedent item set occurs.

A high value of confidence suggests a rule in which the consequent is frequently true when the antecedent is true, but a high value of confidence can be misleading.

For example, if the support of the consequent is high—that is, the item set corresponding to the then part is very frequent—then the confidence of the association rule could be high even if there is little or no association between the items.

A lift ratio greater than 1 suggests that there is some usefulness to the rule and that it is better at identifying cases when the consequent occurs than no rule at all.

For the data in Table 4.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence = 2/4 = 0.5 and a lift ratio = 0.5/(4/10) = 1.25.

In other words, identifying a customer who purchased both bread and jelly as one who also purchased peanut butter is 25% better than just guessing that a random customer purchased peanut butter.

30

Association Rules (Slide 4 of 7)

Table 5.5: Association Rules for Hy-Vee

Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio

Bread Fruit, Jelly 4 5 4 100.0 2.00

Bread Jelly 4 5 4 100.0 2.00

Bread, Fruit Jelly 4 5 4 100.0 2.00

Fruit, Jelly Bread 5 4 4 80.0 2.00

Jelly Bread 5 4 4 80.0 2.00

Jelly Bread, Fruit 5 4 4 80.0 2.00

Fruit, Potato Chips Soda 4 6 4 100.0 1.67

Peanut Butter Milk 4 4 6 100.0 1.67

Peanut Butter Milk, Fruit 4 6 4 100.0 1.67

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Association Rules (Slide 5 of 7)

Table 5.5: Association Rules for Hy-Vee (cont.)

Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio

Peanut Butter, Fruit Milk 4 6 4 100.0 1.67

Potato Chips Fruit, Soda 4 6 4 100.0 1.67

Potato Chips Soda 4 6 4 100.0 1.67

Fruit, Soda Potato Chips 6 4 4 66.7 1.67

Milk Peanut Butter 6 4 4 66.7 1.67

Milk Peanut Butter, Fruit 6 4 4 66.7 1.67

Milk, Fruit Peanut Butter 6 4 4 66.7 1.67

Soda Fruit, Potato Chips 6 4 4 66.7 1.67

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Association Rules (Slide 6 of 7)

Table 5.5: Association Rules for Hy-Vee (cont.)

Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio

Soda Potato Chips 6 4 4 66.7 1.67

Fruit, Soda Milk 6 6 5 83.3 1.39

Milk Fruit, Soda 6 6 5 83.3 1.39

Milk Soda 6 6 5 83.3 1.39

Milk, Fruit Soda 6 6 5 83.3 1.39

Soda Milk 6 6 5 83.3 1.39

Soda Milk, Fruit 6 6 5 83.3 1.39

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Association Rules (Slide 7 of 5)

Evaluating Association Rules:

An association rule is ultimately judged on how actionable it is and how well it explains the relationship between item sets.

For example, Walmart mined its transactional data to uncover strong evidence of the association rule, “If a customer purchases a Barbie doll, then a customer also purchases a candy bar.”

An association rule is useful if it is well supported and explains an important previously unknown relationship.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

The support of an association rule can generally be improved by basing it on less specific antecedent and consequent item sets.

34

Text Mining

Voice of the Customer at Triad Airline

Preprocessing Text Data for Analysis

Movie Reviews

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (1 of 12)

Text, like numerical data, may contain information that can help solve problems and lead to better decisions.

Text mining is the process of extracting useful information from text data.

Text data is often referred to as unstructured data because in its raw form, it cannot be stored in a traditional structured database (rows and columns).

Audio and video data are also examples of unstructured data.

Data mining with text data is more challenging than data mining with traditional numerical data, because it requires more preprocessing to convert the text to a format amenable for analysis.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

36

Text Mining (2 of 12)

Voice of the Customer at Triad Airline:

Triad solicits feedback from its customers through a follow-up e-mail the day after the customer has completed a flight.

Survey asks the customer to rate various aspects of the flight and asks the respondent to type comments into a dialog box in the e-mail; includes:

Quantitative feedback from the ratings.

Comments entered by the respondents which need to be analyzed.

A collection of text documents to be analyzed is called a corpus.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (3 of 12)

Table 5.6: Ten Respondents’ Concerns for Triad Airlines

Concerns

The wi-fi service was horrible. It was slow and cut off several times.

My seat was uncomfortable.

My flight was delayed 2 hours for no apparent reason.

My seat would not recline.

The man at the ticket counter was rude. Service was horrible.

The flight attendant was rude. Service was bad.

My flight was delayed with no explanation.

My drink spilled when the guy in front of me reclined his seat.

My flight was canceled.

The arm rest of my seat was nasty.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (4 of 12)

Voice of the Customer at Triad Airline:

To be analyzed, text data needs to be converted to structured data (rows and columns of numerical data) so that the tools of descriptive statistics, data visualization and data mining can be applied.

Think of converting a group of documents into a matrix of rows and columns where the rows correspond to a document and the columns correspond to a particular word.

A presence/absence or binary term-document matrix is a matrix with the rows representing documents and the columns representing words.

Entries in the columns indicate either the presence or the absence of a particular word in a particular document.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (5 of 12)

Voice of the Customer at Triad Airline (cont.):

Creating the list of terms to use in the presence/absence matrix can be a complicated matter:

Too many terms results in a matrix with many columns, which may be difficult to manage and could yield meaningless results.

Too few terms may miss important relationships.

Term frequency along with the problem context are often used as a guide.

In Triad’s case, management used word frequency and the context of having a goal of satisfied customers to come up with the following list of terms they feel are relevant for categorizing the respondent’s comments: delayed, flight, horrible, recline, rude, seat, and service.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (6 of 12)

Table 5.7: The Presence/Absence Term-Document Matrix for Triad Airlines

Term

Document Delayed Flight Horrible Recline Rude Seat Service

1 0 0 1 0 0 0 1

2 0 0 0 0 0 1 0

3 1 1 0 0 0 0 0

4 0 0 0 1 0 1 0

5 0 0 1 0 1 0 1

6 0 1 0 0 1 0 1

7 1 1 0 0 0 0 0

8 0 0 0 1 0 1 0

9 0 1 0 0 0 0 0

10 0 0 0 0 0 1 0

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (7 of 12)

Preprocessing Text Data for Analysis:

The text-mining process converts unstructured text into numerical data and applies quantitative techniques.

Which terms become the headers of the columns of the term-document matrix can greatly impact the analysis.

Tokenization is the process of dividing text into separate terms, referred to as tokens:

Symbols and punctuations must be removed from the document, and all letters should be converted to lowercase.

Different forms of the same word, such as “stacking,” “stacked,” and “stack” probably should not be considered as distinct terms.

Stemming is the process of converting a word to its stem or root word.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (8 of 12)

Preprocessing Text Data for Analysis (cont.):

The goal of preprocessing is to generate a list of most-relevant terms that is sufficiently small so as to lend itself to analysis:

Frequency can be used to eliminate words from consideration as tokens.

Low-frequency words probably will not be very useful as tokens.

Consolidating words that are synonyms can reduce the set of tokens.

Most text-mining software gives the user the ability to manually specify terms to include or exclude as tokens.

The use of slang, humor, and sarcasm can cause interpretation problems and might require more sophisticated data cleansing and subjective intervention on the part of the analyst to avoid misinterpretation.

Data preprocessing parses the original text data down to the set of tokens deemed relevant for the topic being studied.

© 2021 Cengage Learning. All Rights Reserved. May …

Business Analytics

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Descriptive Data Mining

Chapter 5

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Introduction (Slide 1 of 2)

The increase in the use of data-mining techniques in business has been caused largely by three events:

The explosion in the amount of data being produced and electronically tracked.

The ability to electronically warehouse these data.

The affordability of computer power to analyze the data.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Introduction (Slide 2 of 2)

Observation: Set of recorded values of variables associated with a single entity.

Unsupervised learning: A descriptive data-mining technique used to identify relationships between observations.

Thought of as high-dimensional descriptive analytics.

There is no outcome variable to predict; instead, qualitative assessments are used to assess and compare the results.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

4

Cluster Analysis

Measuring Similarity Between Observations

Hierarchical Clustering

k-Means Clustering

Hierarchical Clustering versus k-Means Clustering

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 1 of 21)

Goal of clustering is to segment observations into similar groups based on observed variables.

Can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration.

Commonly used in marketing to divide customers into different homogenous groups; known as market segmentation.

Used to identify outliers.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 2 of 21)

Clustering methods:

Bottom-up hierarchical clustering starts with each observation belonging to its own cluster and then sequentially merges the most similar clusters to create a series of nested clusters.

k-means clustering assigns each observation to one of k clusters in a manner such that the observations assigned to the same cluster are as similar as possible.

Both methods depend on how two observations are similar—hence, we have to measure similarity between observations.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 3 of 21)

Measuring Similarity Between Observations:

When observations include numeric variables, Euclidean distance is the most common method to measure dissimilarity between observations.

measurements of q variables.

The Euclidean distance between observations u and v is:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 4 of 21)

Measuring Similarity Between Observations:

Illustration:

KTC is a financial advising company that provides personalized financial advice to its clients.

KTC would like to segment its customers into several groups (or clusters) so that the customers within a group are similar and dissimilar with respect to key characteristics.

For each customer, KTC has an observation of seven variables: Age, Female, Income, Married, Children, Car Loan, Mortgage.

Example: The observation u = (61, 0, 57881, 1, 2, 0, 0) corresponds to a 61-year-old male with an annual income of $57,881, married with two children, but no car loan and no mortgage.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

9

Cluster Analysis (Slide 5 of 21)

Figure 5.1: Euclidean Distance

Euclidean distance becomes smaller as a pair of observations become more similar with respect to their variable values.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Figure 4.1 depicts Euclidean distance for two observations consisting of two variable measurements.

Euclidean distance is highly influenced by the scale on which variables are measured.

Therefore, it is common to standardize the units of each variable j of each observation u;

Example: uj, the value of variable j in observation u, is replaced with its z-score, zj.

The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean distance between observations.

10

Cluster Analysis (Slide 6 of 21)

Euclidean distance is highly influenced by the scale on which variables are measured:

Common to standardize the units of each variable j of each observation u.

The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean distance between observations.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

11

Cluster Analysis (Slide 7 of 21)

When clustering observations solely on the basis of categorical variables encoded as 0–1, a better measure of similarity between two observations can be achieved by counting the number of variables with matching values.

The simplest overlap measure is called the matching coefficient and is computed as:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 8 of 21)

A weakness of the matching coefficient is that if two observations both have a 0 entry for a categorical variable, this is counted as a sign of similarity between the two observations.

To avoid misstating similarity due to the absence of a feature, a similarity measure called Jaccard’s coefficient does not count matching zero entries and is computer as:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 9 of 21)

Table 5.1: Comparison of Similarity Matrixes for Observations with Binary Variables

Observation Female Married Loan Mortgage

1 1 0 0 0

2 0 1 1 1

3 1 1 1 0

4 1 1 0 0

5 1 1 0 0

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

14

Cluster Analysis (Slide 10 of 21)

Table 5.1: Comparison of Similarity Matrixes for Observations with Binary Variables (cont.)

Similarity Matrix Based on Matching Coefficient:

Observation 1 2 3 4 5

1 1

2 0 1

3 0.5 0.5 1

4 0.75 0.25 0.75 1

5 0.75 0.25 0.75 1 1

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

15

Cluster Analysis (Slide 11 of 21)

Table 5.1: Comparison of Similarity Matrixes for Observations with Binary Variables (cont.)

Similarity Matrix Based on Jaccard’s Coefficient:

Observation 1 2 3 4 5

1 1

2 0 1

3 0.333 0.5 1

4 0.5 0.25 0.667 1

5 0.5 0.25 0.667 1 1

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

16

Cluster Analysis (Slide 12 of 21)

Hierarchical Clustering:

Determines the similarity of two clusters by considering the similarity between the observations composing either cluster.

Starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster.

Given a way to measure similarity between observations, there are several clustering method alternatives for comparing observations in two clusters to obtain a cluster similarity measure:

Single linkage.

Complete linkage.

Group average linkage.

Median linkage.

Centroid linkage.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 13 of 21)

Single linkage: The similarity between two clusters is defined by the similarity of the pair of observations (one from each cluster) that are the most similar.

Complete linkage: This clustering method defines the similarity between two clusters as the similarity of the pair of observations (one from each cluster) that are the most different.

Group Average linkage: Defines the similarity between two clusters to be the average similarity computed over all pairs of observations between the two clusters.

Median linkage: Analogous to group average linkage except that it uses the median of the similarities computer between all pairs of observations between the two clusters.

Centroid linkage uses the averaging concept of cluster centroids to define between-cluster similarity.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Single linkage will consider two clusters to be close if an observation in one of the clusters is close to at least one observation in the other cluster.

Complete linkage will consider two clusters to be close if their most-different pair of observations are close. This method produces clusters such that all member observations of a cluster are relatively close to each other.

18

Cluster Analysis (Slide 14 of 21)

Figure 5.2: Measuring Similarity Between Clusters

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 15 of 21)

Ward’s method merges two clusters such that the dissimilarity of the observations with the resulting single cluster increases as little as possible.

When McQuitty’s method considers merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C is calculated as: ((dissimilarity between A and C) + (dissimilarity between B and C)) divided by 2).

A dendrogram is a chart that depicts the set of nested clusters resulting at each step of aggregation.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 16 of 21)

Figure 5.3: Dendrogram for KTC Using Matching Coefficients and Group Average Linkage

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Cluster Analysis (Slide 17 of 21)

k-Means Clustering:

Given a value of k, the k-means algorithm randomly assigns each observation to one of the k clusters.

After all observations have been assigned to a cluster, the resulting cluster centroids are calculated.

Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

The algorithm repeats this process (calculate cluster centroid, assign observation to cluster with nearest centroid) until there is no change in the clusters or a specified maximum number of iterations is reached.

One rule of thumb is that the ratio of between-cluster distance to within-cluster distance should exceed 1.0 for useful clusters.

22

Cluster Analysis (Slide 18 of 21)

Figure 5.4: Clustering Observations by Age and Income Using

k-Means Clustering with k = 3

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

To illustrate k-means clustering, we consider a 3-means clustering of a small sample of KTC’s customer data in the file DemoKTC.

Figure 4.4 shows three clusters based on customer income and age.

Cluster 1 is characterized by relatively younger, lower-income customers (Cluster 1’s centroid is at [33, $20,364]).

Cluster 2 is characterized by relatively older, higher-income customers (Cluster 2’s centroid is at [58, $47,729]).

Cluster 3 is characterized by relatively older, lower-income customers (Cluster 3’s centroid is at [53, $21,416]).

23

Cluster Analysis (Slide 19 of 21)

Table 5.2: Average Distances Within Clusters

No. of Observations Average Distance Between Observations in Cluster

Cluster 1 12 0.622

Cluster 2 8 0.739

Cluster 3 10 0.520

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Table 4.2 shows that Cluster 2 is the smallest, most heterogeneous cluster, whereas Cluster 1 is the largest cluster, and Cluster 3 is the most homogeneous cluster.

In Table 4.3, we compare the average distances between clusters to the average distance within clusters in Table 4.2.

Cluster 1 and Cluster 2 are the most distinct from each other.

Cluster 2 and Cluster 3 are the least distinct from each other.

Comparing the distance between the Cluster 2 and Cluster 3 centroids (1.964) to the average distance between observations within Cluster 2 (0.739), suggests that there are observations within Cluster 2 that are more similar to those in Cluster 3 than to those in Cluster 2.

24

Cluster Analysis (Slide 20 of 21)

Table 5.3: Distances Between Cluster Centroids

Cluster 1 Cluster 2 Cluster 3

Cluster 1 0 2.784 1.529

Cluster 2 2.784 0 1.964

Cluster 3 1.529 1.964 0

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Table 4.2 shows that Cluster 2 is the smallest, most heterogeneous cluster, whereas Cluster 1 is the largest cluster, and Cluster 3 is the most homogeneous cluster.

In Table 4.3, we compare the average distances between clusters to the average distance within clusters in Table 4.2.

Cluster 1 and Cluster 2 are the most distinct from each other.

Cluster 2 and Cluster 3 are the least distinct from each other.

Comparing the distance between the Cluster 2 and Cluster 3 centroids (1.964) to the average distance between observations within Cluster 2 (0.739), suggests that there are observations within Cluster 2 that are more similar to those in Cluster 3 than to those in Cluster 2.

25

Cluster Analysis (Slide 21 of 21)

Hierarchical Clustering versus k-Means Clustering

Hierarchical Clustering k-Means Clustering

Suitable when we have a small data set (e.g., fewer than 500 observations) and want to easily examine solutions with increasing numbers of clusters. Suitable when you know how many clusters you want and you have a larger data set (e.g., more than 500 observations).

Convenient method if you want to observe how clusters are nested. Partitions the observations,

which is appropriate if trying to summarize the data with k “average” observations

that describe the data with the minimum amount of error.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Because Euclidean distance is the standard metric for k-means clustering, it is generally not as appropriate for binary or ordinal data for which an “average” is not meaningful.

26

Association Rules

Evaluating Association Rules

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Association Rules (Slide 1 of 7)

Association rules: If-then statements which convey the likelihood of certain items being purchased together.

Although association rules are an important tool in market basket analysis, they are also applicable to other disciplines.

Antecedent: The collection of items (or item set) corresponding to the if portion of the rule.

Consequent: The item set corresponding to the then portion of the rule.

Support count of an item set: Number of transactions in the data that include that item set.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

28

Association Rules (Slide 2 of 7)

Table 5.4: Shopping-Cart Transactions

Transaction Shopping Cart

1 bread, peanut butter, milk, fruit, jelly

2 bread, jelly, soda, potato chips, milk, fruit, vegetables, peanut butter

3 whipped cream, fruit, chocolate sauce, beer

4 steak, jelly, soda, potato chips, bread, fruit

5 jelly, soda, peanut butter, milk, fruit

6 jelly, soda, potato chips, milk, bread, fruit

7 fruit, soda, potato chips, milk

8 fruit, soda, peanut butter, milk

9 fruit, cheese, yogurt

10 yogurt, vegetables, beer

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hy-Vee grocery store would like to gain insight into its customers’ purchase patterns to possibly improve its in-aisle product placement and cross-product promotions.

Table 4.4 contains a small sample of data where each transaction comprises the items purchased by a shopper in a single visit to a Hy-Vee.

An example of an association rule from this data would be “if {bread, jelly}, then {peanut butter}” meaning that “if a transaction includes bread and jelly, then it also includes peanut butter.”

Antecedent – {bread, jelly},

Consequent – {peanut butter}

The potential impact of an association rule is often governed by the number of transactions it may affect, which is measured by computing the support count of the item set consisting of the union of its antecedent and consequent.

Investigating the rule “if {bread, jelly}, then {peanut butter}” from Table 4.4, we see the support count of {bread, jelly, peanut butter} is 2.

29

Association Rules (Slide 3 of 7)

Confidence: Helps identify reliable association rules:

Lift ratio: Measure to evaluate the efficiency of a rule:

For the data in Table 5.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

This measure of confidence can be viewed as the conditional probability of the consequent item set occurs given that the antecedent item set occurs.

A high value of confidence suggests a rule in which the consequent is frequently true when the antecedent is true, but a high value of confidence can be misleading.

For example, if the support of the consequent is high—that is, the item set corresponding to the then part is very frequent—then the confidence of the association rule could be high even if there is little or no association between the items.

A lift ratio greater than 1 suggests that there is some usefulness to the rule and that it is better at identifying cases when the consequent occurs than no rule at all.

For the data in Table 4.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence = 2/4 = 0.5 and a lift ratio = 0.5/(4/10) = 1.25.

In other words, identifying a customer who purchased both bread and jelly as one who also purchased peanut butter is 25% better than just guessing that a random customer purchased peanut butter.

30

Association Rules (Slide 4 of 7)

Table 5.5: Association Rules for Hy-Vee

Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio

Bread Fruit, Jelly 4 5 4 100.0 2.00

Bread Jelly 4 5 4 100.0 2.00

Bread, Fruit Jelly 4 5 4 100.0 2.00

Fruit, Jelly Bread 5 4 4 80.0 2.00

Jelly Bread 5 4 4 80.0 2.00

Jelly Bread, Fruit 5 4 4 80.0 2.00

Fruit, Potato Chips Soda 4 6 4 100.0 1.67

Peanut Butter Milk 4 4 6 100.0 1.67

Peanut Butter Milk, Fruit 4 6 4 100.0 1.67

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Association Rules (Slide 5 of 7)

Table 5.5: Association Rules for Hy-Vee (cont.)

Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio

Peanut Butter, Fruit Milk 4 6 4 100.0 1.67

Potato Chips Fruit, Soda 4 6 4 100.0 1.67

Potato Chips Soda 4 6 4 100.0 1.67

Fruit, Soda Potato Chips 6 4 4 66.7 1.67

Milk Peanut Butter 6 4 4 66.7 1.67

Milk Peanut Butter, Fruit 6 4 4 66.7 1.67

Milk, Fruit Peanut Butter 6 4 4 66.7 1.67

Soda Fruit, Potato Chips 6 4 4 66.7 1.67

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Association Rules (Slide 6 of 7)

Table 5.5: Association Rules for Hy-Vee (cont.)

Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio

Soda Potato Chips 6 4 4 66.7 1.67

Fruit, Soda Milk 6 6 5 83.3 1.39

Milk Fruit, Soda 6 6 5 83.3 1.39

Milk Soda 6 6 5 83.3 1.39

Milk, Fruit Soda 6 6 5 83.3 1.39

Soda Milk 6 6 5 83.3 1.39

Soda Milk, Fruit 6 6 5 83.3 1.39

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Association Rules (Slide 7 of 5)

Evaluating Association Rules:

An association rule is ultimately judged on how actionable it is and how well it explains the relationship between item sets.

For example, Walmart mined its transactional data to uncover strong evidence of the association rule, “If a customer purchases a Barbie doll, then a customer also purchases a candy bar.”

An association rule is useful if it is well supported and explains an important previously unknown relationship.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

The support of an association rule can generally be improved by basing it on less specific antecedent and consequent item sets.

34

Text Mining

Voice of the Customer at Triad Airline

Preprocessing Text Data for Analysis

Movie Reviews

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (1 of 12)

Text, like numerical data, may contain information that can help solve problems and lead to better decisions.

Text mining is the process of extracting useful information from text data.

Text data is often referred to as unstructured data because in its raw form, it cannot be stored in a traditional structured database (rows and columns).

Audio and video data are also examples of unstructured data.

Data mining with text data is more challenging than data mining with traditional numerical data, because it requires more preprocessing to convert the text to a format amenable for analysis.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

36

Text Mining (2 of 12)

Voice of the Customer at Triad Airline:

Triad solicits feedback from its customers through a follow-up e-mail the day after the customer has completed a flight.

Survey asks the customer to rate various aspects of the flight and asks the respondent to type comments into a dialog box in the e-mail; includes:

Quantitative feedback from the ratings.

Comments entered by the respondents which need to be analyzed.

A collection of text documents to be analyzed is called a corpus.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (3 of 12)

Table 5.6: Ten Respondents’ Concerns for Triad Airlines

Concerns

The wi-fi service was horrible. It was slow and cut off several times.

My seat was uncomfortable.

My flight was delayed 2 hours for no apparent reason.

My seat would not recline.

The man at the ticket counter was rude. Service was horrible.

The flight attendant was rude. Service was bad.

My flight was delayed with no explanation.

My drink spilled when the guy in front of me reclined his seat.

My flight was canceled.

The arm rest of my seat was nasty.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (4 of 12)

Voice of the Customer at Triad Airline:

To be analyzed, text data needs to be converted to structured data (rows and columns of numerical data) so that the tools of descriptive statistics, data visualization and data mining can be applied.

Think of converting a group of documents into a matrix of rows and columns where the rows correspond to a document and the columns correspond to a particular word.

A presence/absence or binary term-document matrix is a matrix with the rows representing documents and the columns representing words.

Entries in the columns indicate either the presence or the absence of a particular word in a particular document.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (5 of 12)

Voice of the Customer at Triad Airline (cont.):

Creating the list of terms to use in the presence/absence matrix can be a complicated matter:

Too many terms results in a matrix with many columns, which may be difficult to manage and could yield meaningless results.

Too few terms may miss important relationships.

Term frequency along with the problem context are often used as a guide.

In Triad’s case, management used word frequency and the context of having a goal of satisfied customers to come up with the following list of terms they feel are relevant for categorizing the respondent’s comments: delayed, flight, horrible, recline, rude, seat, and service.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (6 of 12)

Table 5.7: The Presence/Absence Term-Document Matrix for Triad Airlines

Term

Document Delayed Flight Horrible Recline Rude Seat Service

1 0 0 1 0 0 0 1

2 0 0 0 0 0 1 0

3 1 1 0 0 0 0 0

4 0 0 0 1 0 1 0

5 0 0 1 0 1 0 1

6 0 1 0 0 1 0 1

7 1 1 0 0 0 0 0

8 0 0 0 1 0 1 0

9 0 1 0 0 0 0 0

10 0 0 0 0 0 1 0

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (7 of 12)

Preprocessing Text Data for Analysis:

The text-mining process converts unstructured text into numerical data and applies quantitative techniques.

Which terms become the headers of the columns of the term-document matrix can greatly impact the analysis.

Tokenization is the process of dividing text into separate terms, referred to as tokens:

Symbols and punctuations must be removed from the document, and all letters should be converted to lowercase.

Different forms of the same word, such as “stacking,” “stacked,” and “stack” probably should not be considered as distinct terms.

Stemming is the process of converting a word to its stem or root word.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Text Mining (8 of 12)

Preprocessing Text Data for Analysis (cont.):

The goal of preprocessing is to generate a list of most-relevant terms that is sufficiently small so as to lend itself to analysis:

Frequency can be used to eliminate words from consideration as tokens.

Low-frequency words probably will not be very useful as tokens.

Consolidating words that are synonyms can reduce the set of tokens.

Most text-mining software gives the user the ability to manually specify terms to include or exclude as tokens.

The use of slang, humor, and sarcasm can cause interpretation problems and might require more sophisticated data cleansing and subjective intervention on the part of the analyst to avoid misinterpretation.

Data preprocessing parses the original text data down to the set of tokens deemed relevant for the topic being studied.

© 2021 Cengage Learning. All Rights Reserved. May …

Business Analytics

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Statistical Inference

Chapter 6

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Introduction (Slide 1 of 2)

A census collects data from every element in the population of interest.

Many potential difficulties associated with taking a census; it may be:

Expensive.

Time consuming.

Misleading.

Unnecessary.

Impractical.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Introduction (Slide 2 of 2)

Statistical inference uses sample data to make estimates of or draw conclusions about one or more characteristics of a population.

The sampled population is the population from which the sample is drawn.

A frame is a list of elements from which the sample will be selected.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

4

Selecting a Sample

Sampling from a Finite Population

Sampling from an Infinite Population

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Selecting a Sample (Slide 1 of 4)

Parameter: A measurable factor that defines a characteristic of a population, process, or system.

Sampling from a Finite Population:

Statisticians recommend selecting a probability sample when sampling from a finite population because a probability sample allows you to make valid statistical inferences about the population.

Simple Random Sample (Finite Population):

A simple random sample of size n from a finite population of size N is a sample selected such that each possible sample of size n has the same probability of being selected.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Selecting a Sample (Slide 2 of 4)

Figure 6.1: Using Excel to Select a Simple Random Sample

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Selecting a Sample (Slide 3 of 4)

Sampling from an Infinite Population:

With an infinite population, you cannot select a simple random sample because you cannot construct a frame consisting of all the elements.

Statisticians recommend selecting what is called a random sample.

Random Sample (Infinite Population):

A random sample of size n from an infinite population is a sample selected such that the following conditions are satisfied:

Each element selected comes from the same population.

Each element is selected independently.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Selecting a Sample (Slide 4 of 4)

Care and judgment must be implemented in the selection process for a random sample from an infinite population:

Each element selected comes from the same population.

Each element is selected independently.

Situations involving sampling from an infinite population are usually associated with a process that operates over time.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Point Estimation

Practical Advice

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Point Estimation (Slide 1 of 5)

To estimate the value of a population parameter, compute a corresponding characteristic of the sample—a sample statistic.

Using the data in Table 6.1:

The sample mean is:

The sample proportion is:

The sample standard deviation is:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Meghan Cook (MC) – I took a screenshot for the first equation because the numbers changed.

Point Estimation (Slide 2 of 5)

Calculating sample mean, sample standard deviation, and sample proportion is called point estimation:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Point Estimation (Slide 3 of 5)

Table 6.1: Annual Salary and Training Program Status for a Simple Random Sample of 30 EAI Employees

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Point Estimation (Slide 4 of 5)

Table 6.2: Summary of Point Estimates Obtained from a Simple Random Sample of 30 EAI Employees

The point estimates differ somewhat from the values of corresponding population parameters.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Point Estimation (Slide 5 of 5)

Practical Advice:

When making inferences, it is important to have a close correspondence between the sampled population and the target population:

Target population: Population about which we want to make inferences.

Sampled population: Population from which the sample is taken.

Good judgment is a necessary ingredient of sound statistical practice.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions

Sampling Distribution of

Sampling Distribution of

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 1 of 18)

A random variable is a quantity whose values are not known with certainty.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 2 of 18)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 3 of 18)

Mean Annual Salary ($) Frequency Relative Frequency

69,500.00–69,999.99 2 0.004

70,000.00–70,499.99 16 0.032

70,500.00–70,999.99 52 0.104

71,000.00–71,499.99 101 0.202

71,500.00–71,999.99 133 0.266

72,000.00–72,499.99 110 0.220

72,500.00–72,999.99 54 0.108

73,000.00–73,499.99 26 0.052

73,500.00–73,999.99 6 0.012

Totals: 500 1.000

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

19

Sampling Distributions (Slide 4 of 18)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 5 of 18)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 6 of 18)

Sampling distribution has:

An expected value or mean.

A standard deviation.

A characteristic shape or form.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 7 of 18)

When the expected value of a point estimator equals the population parameter, we say the point estimator is unbiased.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 8 of 18)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 9 of 18)

Finite population correction factor:

In many practical sampling situations, the finite population correction factor is close to 1, so the difference between the values of the standard deviation for the finite and infinite populations is negligible.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 10 of 18)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 11 of 18)

When the population has a normal distribution, the sampling

When the population does not have a normal distribution, the central limit theorem is helpful in identifying the shape of the sampling

Central limit theorem:

In selecting random samples of size n from a population, the sampling

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 12 of 18)

Figure 6.4: Illustration of the Central Limit Theorem for Three Populations

Top panel shows that none of the populations are normally distributed.

Bottom three panels show the shape of the sampling distribution for samples n = 2, n = 5, and n = 30.

General statistical practice is to assume that, for most applications, the sampling distribution can be approximated by normal distribution whenever the sample size is 30 or more.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 13 of 18)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 14 of 18)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 15 of 18)

The formula for computing the sample proportion is:

where

x = the number of elements in the sample that possess the characteristic of interest.

n = sample size.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 16 of 18)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 17 of 18)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Sampling Distributions (Slide 18 of 18)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation

Interval Estimation of the Population Mean

Interval Estimation of the Population Proportion

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 1 of 15)

Because a point estimator cannot be expected to provide the exact value of a population parameter, interval estimation is frequently used to generate an estimate of the value of a population parameter.

An interval estimate is often computed by adding and subtracting a value, called the margin of error, to the point estimate.

The general form of an interval estimate is:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 2 of 15)

Interval Estimation of the Population Mean:

An interval estimate provides information about how close the point estimate is to the value of the population parameter.

General form of an interval estimate of a population mean is:

General form of an interval estimate of a population proportion is:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 3 of 15)

Interval Estimation of the Population Mean (cont.):

For any normally distributed random variable:

90% of the values lie within 1.645 standard deviations of the mean.

95% of the values lie within 1.960 standard deviations of the mean.

99% of the values lie within 2.576 standard deviations of the mean.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 4 of 15)

Figure 6.8: Sampling Distribution of the Sample Mean

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 5 of 15)

If the sampling distribution follows a normal distribution, address this additional source of uncertainty by using a probability distribution known as the t distribution:

A family of similar probability distributions.

The shape of each specific one depends on a parameter referred to as the degrees of freedom.

Similar in shape to the standard normal distribution, but wider.

As the degrees of freedom increase, the t distribution narrows, its peak becomes higher, and it becomes more similar to the standard normal distribution.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 6 of 15)

Figure 6.9: Comparison of the Standard Normal Distribution with t Distributions with 10 and 20 Degrees of Freedom

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 7 of 15)

Figure 6.10: t Distribution with 29 Degrees of Freedom

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 8 of 15)

Figure 6.11: Intervals Formed Around Sample Means from 10 Independent Random Samples

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 9 of 15)

Because approximately 90% of all the intervals constructed will contain the population mean, we say that we are approximately 90% confident that the interval will include the population mean:

Say that the interval has been established at the 90% confidence level.

The value of 0.90 is referred to as the confidence coefficient.

The interval is called the 90% confidence interval.

The level of significance is the probability that the interval estimation procedure will generate an interval that does not contain the population mean:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 10 of 15)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 11 of 15)

Table 6.5: Credit Card Balances for a Sample of 70 Households

9,430 14,661 7,159 9,071 9,691 11,032

7,535 12,195 8,137 3,603 11,448 6,525

4,078 10,544 9,467 16,804 8,279 5,239

5,604 13,659 12,595 13,479 5,649 6,195

5,179 7,061 7,917 14,044 11,298 12,584

4,416 6,245 11,346 6,817 4,353 15,415

10,676 13,021 12,806 6,845 3,467 15,917

1,627 9,719 4,972 10,493 6,191 12,591

10,112 2,200 11,356 615 12,851 9,743

6,567 10,746 7,117 13,627 5,337 10,324

13,627 12,744 9,465 12,557 8,372

18,719 5,742 19,263 6,232 7,445

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 12 of 15)

Figure 6.13: 95% Confidence Interval for Credit Card Balances

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 13 of 15)

Interval Estimation of the Population Proportion:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 14 of 15)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Interval Estimation (Slide 15 of 15)

Figure 6.15: 95% Confidence Interval for Survey of Women Golfers

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests

Developing Null and Alternative Hypothesis

Type I and Type II Errors

Hypothesis Test of the Population Mean

Hypothesis Test of the Population Proportion

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 1 of 27)

The tentative conjecture is called the null hypothesis.

The opposite of what is stated in the null hypothesis is the alternative hypothesis.

The hypothesis testing procedure uses data from a sample to test the validity of the two competing statements about a population.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 2 of 27)

Developing Null and Alternative Hypotheses:

Context of the situation is very important in determining how the hypotheses should be stated.

All hypothesis testing applications involve collecting a random sample and using the sample results to provide evidence for drawing a conclusion.

Ask:

What is the purpose of collecting the sample?

What conclusions are we hoping to make?

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 3 of 27)

Developing Null and Alternative Hypotheses (cont.):

Many applications of hypothesis testing involve an attempt to gather evidence in support of a research hypothesis—best to begin with the alternative hypothesis and make it the conclusion that the researcher hopes to support.

Not all hypothesis tests involve research hypothesis:

Begin with a belief or a conjecture that a statement about the value of a population parameter is true.

Use a hypothesis test to challenge the conjecture and determine whether there is statistical evidence to conclude that the conjecture is incorrect.

Helpful to develop the null hypothesis first; the alternative hypothesis is that the belief or conjecture is incorrect.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 4 of 27)

Developing Null and Alternative Hypotheses (cont.):

Depending upon the situation, hypothesis tests about a population parameter may take one of three forms:

Two use inequalities in the null hypothesis.

Third one uses an equality in the null hypothesis:

First two forms are called one-tailed tests.

Third form is called a two-tailed test.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 5 of 27)

Type I and Type II Errors:

Table 6.6: Errors and Correct Conclusions in Hypothesis Testing

Population Condition

H0 True Ha True

Conclusion Do Not Reject H0 Correct conclusion Type II error

Reject H0 Type I error Correct conclusion

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 6 of 27)

Type I and Type II Errors (cont.):

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 7 of 27)

Level of Significance:

The level of significance is the probability of making a Type I error when the null hypothesis is true as an equality.

The person responsible for the hypothesis test specifies the level of significance and the probability of making a Type I error.

Applications of hypothesis testing that only control the Type I error are called significance tests.

Most applications of hypothesis testing control the probability of making a Type I error; they do not always control the probability of making a Type II error.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 8 of 27)

Hypothesis Test of the Population Mean:

One tailed tests about a population mean take one of the following forms:

Develop the null and alternative hypothesis for the test.

Specify the level of significance for the test.

Collect the sample data and compute the value of what is called a test statistic.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 9 of 27)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 10 of 27)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 11 of 27)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 12 of 27)

The key question for a lower-tail test is, How small must the test statistic t be before we choose to reject the null hypothesis?

P Value:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 13 of 27)

Figure 6.18: Hypothesis Test about a Population Mean

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 14 of 27)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 15 of 27)

Different decision makers may express different opinions concerning the cost of making a Type I error and may choose a different level of significance.

Providing the p value as part of the hypothesis testing results allows decision makers to compare the reported p value to his or her own level of significance.

The level of significance indicates the strength of evidence that is needed in the sample data before rejection of the null hypothesis.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 16 of 27)

For an upper-tail test, the p value is the probability of obtaining a value for the test statistic as large as or larger than that provided by the sample.

Computation of p Values for One-Tailed Tests:

1. Compute the value of the test statistic using equation (6.11).

2. Lower-tail test: Using the t distribution, compute the probability that t is less than or equal to the value of the test statistic (area in the lower tail).

3. Upper-tail test: Using the t distribution, compute the probability that t is greater than or equal to the value of the test statistic (area in the upper tail).

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 17 of 27)

In hypothesis testing, the general form for a two-tailed test about population mean is:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 18 of 27)

Figure 6.20: p Value for the Holiday Toys Two-Tailed Hypothesis Test

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 19 of 27)

Figure 6.21: Two-Tailed Hypothesis Test for Holiday Toys

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 20 of 27)

Computation of p Values for Two-Tailed Tests:

1. Compute the value of the test statistic using equation (6.11).

2. If the value of the test statistic is in the upper tail, compute the probability that t is greater than or equal to the value of the test statistic (the upper-tail area). If the value of the test statistic is in the lower tail, compute the probability that t is less than or equal to the value of the test statistic (the lower-tail area).

3. Double the probability (or tail area) from step 2 to obtain the p value.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 21 of 27)

Table 6.7: Summary of Hypothesis Tests About a Population Mean

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 22 of 27)

Steps of Hypothesis Testing:

Step 1. Develop the null and alternative hypotheses.

Step 2. Specify the level of significance.

Step 3. Collect the sample data and compute the value of the test statistic.

Step 4. Use the value of the test statistic to compute the p value.

Step 5. Reject

Step 6. Interpret the statistical conclusion in the context of the application.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 23 of 27)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 24 of 27)

Hypothesis Test of the Population Proportion:

The three forms for a hypothesis test about a population proportion are:

The first form is called a lower-tail test.

The second form is called an upper-tail test.

The third form is called a two-tailed test.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 25 of 27)

Figure 6.22: Calculation of the p Value for the Pine Creek Hypothesis Test

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 26 of 27)

Figure 6.23: Hypothesis Test for Pine Creek Golf Course

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Hypothesis Tests (Slide 27 of 27)

Table 6.8: Summary of Hypothesis Tests About a Population Proportion

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Big Data, Statistical Inference, and Practical Significance

Sampling Error

Nonsampling Error

Big Data

Understanding What Big Data Is

Big Data and Sampling Error

Big Data and the Precision of Confidence …

#### Why Choose Us

- 100% non-plagiarized Papers
- 24/7 /365 Service Available
- Affordable Prices
- Any Paper, Urgency, and Subject
- Will complete your papers in 6 hours
- On-time Delivery
- Money-back and Privacy guarantees
- Unlimited Amendments upon request
- Satisfaction guarantee

#### How it Works

- Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
- Fill in your paper’s requirements in the "
**PAPER DETAILS**" section. - Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
- Click “
**CREATE ACCOUNT & SIGN IN**” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page. - From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.