Bivariate Data

A univariate data set consists on only one variable, like Income of individual families, Heights of children in a given age group, Test Scores or ages of employees in an organization. But there are many situations where we need to observe two variables for the required study. Also we may be interested to know whether the subject under study is related to another variable. Hence the study of Bivariate Data provides tools, techniques and methods for the purpose of analysis and inference of Bivariate Data Distribution.

A contingency table is used to display the bivariate data when both the variables are classified as categorical.

Bivariate data consists of two variables, whose relationship is to be analyzed.

The Variables in a Bivariate data distribution can both be numerical, can both be categorical or one numerical and one categorical. If the analysis shows that one variable is influenced by the second variable, then the two variables are correspondingly known as dependent and independent variables.

The techniques applied in the analysis of Bivariate Data depend on the types of data involved in the distribution.

Scatter Plot and Regression Line

When both the variables in a Bivariatle data set are quantitative or numerical type, a scatter plot is used to study the relationship between the two variables. Each pair of variables is considered as an ordered pair and plotted on a graph. The independent variable is measured along the X - axis (Horizontal axis) and the dependent variable is measured along the Vertical Y-axis. From the pattern of the plots, we can analyze the correlation between the two variables.

The above scatter Plot shows the relationship between the average number of hours studied per week and the final score.A positive correlation can be recognized from the pattern seen. Using the data set a regression line or trend line can be found using various methods. The equation of the regression line is useful in forecasting future behavior.

Numerical Variable and a Categorical Variable

A back to back stem plot or a Histogram is used to display Bivariate data consisting of a numerical variable and a categorical variable with categories.The following table shows the weights of new born babies in a hypotherical Hospital during the course of a month.

Weights in Kg

Boys

3.5

4.3

5.0

3.6

4.9

3.5

3.8

4.8

3.6

4.2

Girls

3.0

2.8

3.8

3.2

4.1

3.1

2.7

3.3

3.6

3.2

The back to back stem plot is shown above, which can be used for further analysis of finding the median and the quartiles.

When the categorical data consists of more than two categories parallel box plots can be constructed displaying the five point summary of each category.

Example:

The following contingency table shows the ice cream flavor preferences between male and female students

Flavor

Male

Female

Total

Vanilla

9

5

14

Chocolate

12

20

32

Strawberry

12

15

27

Caramel

15

12

27

Banana Split

12

8

20

Total

60

60

120

This contingency table can be used for analyzing the bivariate data using different techniques. The frequencies here can be expressed as percentages and compared. Or this can be used in testing the claim on population behavior using advanced techniques like Hypothesis testing.

Solved Examples

Question 1: The table below shows the height of a player and the average number of points made in a single basket ball match.

Height in cm x

AveragePoints Scored y

Height in cm x

AveragePoints Scored y

Height in cm x

AveragePoints Scored y

184

12

200

20

199

18

194

22

188

18

177

6

185

6

184

14

184

16

174

5

188

12

178

8

186

14

182

14

190

20

183

10

185

10

193

24

175

8

183

18

204

24

Use technology to draw a scatter plot of the data given and discuss the correlation between the height and average points scored. Also use the technology to find the line of best fit for the bivariate data plotted. Solution:

From the scatter plot pattern a positive correlation between the height of the player and the points scored can be inferred. The Equation of the Regression line is y = 0.611X -99.699. The correlation coefficient r = 0.82 which tells that a moderate positive correlation exists between the two variables.

Question 2: The heights (in cm)of students in three grades in a High School are given below. Find the five point summary for each group, plot the summary in parallel box plots.