Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Statistical Analysis of House Prices and Living Areas: T-Tests and ANOVA, Lab Reports of Applied Statistics

The results of various statistical tests performed on house price and living area data. The tests include one-sample t-tests, two-sample t-tests, and analysis of variance (anova). The data is presented for two different sets of variables: livarea and sprice. The tests aim to determine if the means of these variables are significantly different from a hypothesized value or if there is a significant difference between the means of two groups.

Typology: Lab Reports

2021/2022

Uploaded on 12/22/2023

hung-nguyen-djuc-1
hung-nguyen-djuc-1 🇻🇳

1 document

1 / 40

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Contents
Activity 1 1
Introduction................................................. 1
I.Analysisqualitative ........................................... 2
II.AnalysisQuantifier ........................................... 4
III.QuantifierandQualitative ...................................... 8
IV.HypothesisTesting........................................... 10
V.Predictionmodel ............................................ 14
VI.Solvingheteroskedasticity....................................... 17
Conclusionforthefinalmodel....................................... 24
Activity 2 24
Introduction................................................. 24
I.Cleandataset............................................... 25
II.Preprocessoutliers ........................................... 26
III.DescriptiveStatistics ......................................... 26
IV.Testinghypothesis........................................... 30
V.Predictionmodel ............................................ 31
VI.Recommenddifferentmodel...................................... 36
VII.Explaintheproblems......................................... 40
Conclusion ................................................. 40
Activity 1
Introduction
This dataset contains 1500 houses sold in Stockton, California, during 1996 -1998. The purpose of dataset in
dataset is to examine how the sale price of houses in Stockton, California, are affected by house characteristics.
There are 7 variables:
Quantifier: sprice, livarea, beds, baths, age
Qualitative: lgelot, pool
Variable Definition
sprice Selling price of home, in dollars
livarea Living area, in hundreds of square feet
beds Number of beds
baths Number of baths
lgelot 1 if lot size is greater than 0.5 acres, 0 otherwise
age Age of home at time of sale, in years
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28

Partial preview of the text

Download Statistical Analysis of House Prices and Living Areas: T-Tests and ANOVA and more Lab Reports Applied Statistics in PDF only on Docsity!

Contents

Activity 1 1

Introduction................................................. 1 I. Analysis qualitative........................................... 2 II. Analysis Quantifier........................................... 4 III. Quantifier and Qualitative...................................... 8 IV. Hypothesis Testing........................................... 10 V. Prediction model............................................ 14 VI. Solving heteroskedasticity....................................... 17 Conclusion for the final model....................................... 24

Activity 2 24

Introduction................................................. 24 I. Clean dataset............................................... 25 II. Preprocess outliers........................................... 26 III. Descriptive Statistics......................................... 26 IV. Testing hypothesis........................................... 30 V. Prediction model............................................ 31 VI. Recommend different model...................................... 36 VII. Explain the problems......................................... 40 Conclusion................................................. 40

Activity 1

Introduction

This dataset contains 1500 houses sold in Stockton, California, during 1996 -1998. The purpose of dataset in dataset is to examine how the sale price of houses in Stockton, California, are affected by house characteristics. There are 7 variables:

  • Quantifier: sprice, livarea, beds, baths, age
  • Qualitative: lgelot, pool

Variable Definition sprice Selling price of home, in dollars livarea Living area, in hundreds of square feet beds Number of beds baths Number of baths lgelot 1 if lot size is greater than 0.5 acres, 0 otherwise age Age of home at time of sale, in years

Variable Definition pool 1 if home has a pool, 0 otherwise

Data source: Dr. John Knight, Department of Finance, University of the Pacific.

I. Analysis qualitative

a) pool:

category no pool pool

Comment: The number of houses without a pool is large (1402), while the number of houses with a pool is only a small part (98) in the collected data table. There is a significant difference between the number of houses with and without a pool. The proportion of the houses with pool in the dataset is smaller than 7%.

b) lgelot:

category <=0.

Comment: The proportion of houses with a size of larger than 0.5 acres in houses without pool is smaller than the proportion of houses with a size of larger than 0.5 acres in houses with pool.

II. Analysis Quantifier

a) sprice

Histogram of sprice

sprice (dollars)

Frequency

Comment : The chart has a positive skew. The selling price with the highest number falls around 100, dollars. The selling price falls mostly in the range of 50,000 dollars to 140,000 dollars. The average selling price is about 100,000 dollars.

[1] "Percent of outliers: 0.0673333333333333"

sd = 63250.

Comment : The data has about 101 outliers, which is accounted for 6.7% in dataset. The data does not have large fluctuations. The highest price is around over 700,000 dollars and the lowest price is about 22, dollars. The price difference is about 63000 dollars.

b) age

Histogram of age

age (years)

Frequency

Comment : Houses that are about 18 years occupies the largest number. Houses are distributed from when they were built to about 50 years, with houses for the most from 10 to 22 years old. Houses that are above 60 years are rare and almost non-existent. The average age of the house is around 18

[1] "Percent of outliers: 0.00333333333333333"

[1] "Percent of outliers: 0.0366666666666667"

sd = 5.

## [1] 55

Comment : The number of outliers is not significant compared to the dataset. The data has fluctuations but not large. The largest area is close to 50 hundreds of square feet and the smallest is around 5 hundreds of square feet. The area difference is about 5 hundreds of square feet.

d) sprice and livarea

livarea(hundreds of ft^2)

sprice(dollars)

## [1] 0.

Comment : Based on the graph, the higher the area of the house, the higher the selling price of the house. The correlation rate is nearly 80%.

e) livarea and baths

## [1] 0.

Comment : Drawing like above, based on the graph, the more baths there are, the higher the area of the house. The correlation between baths and livarea is about 72%

f) sprice and beds

beds

livarea(hundreds of ft^2)

## [1] 0.

Comment : Continue with beds, based on the boxplot graph, the more bedrooms a house has, the higher the area of the house. On average, the area of a house with 2 bedrooms is smaller than that of a house with 3 bedrooms.There is a relatively correlation of about 58%.

III. Quantifier and Qualitative

a) livarea and lgelot

lgelot

livarea(hundreds of ft^2)

IV. Hypothesis Testing

I.Qualitative

a) The proportion of the houses with pool in the dataset is not more than 7%.

p is the proportion of the houses with pool in the dataset.

H 0 : p = 7%

Ha : p < 7%

1-sample proportions test without continuity correction

data: bangpool[2] out of length(pool), null probability 0.

X-squared = 0.50179, df = 1, p-value = 0.

alternative hypothesis: true p is less than 0.

95 percent confidence interval:

0.00000000 0.

sample estimates:

p

0.

Because pvalue = 0_._ 2394 > 0_._ 05 → Accept H 0. Therefore, we cannot conclude that the proportion of the houses with pool in the dataset is not more than 7% at risk level α = 5%

b) The proportion of the houses with size larger than 0.5 acres is bigger than 6%.

p is the proportion of the houses with size larger than 0.5 acres

H 0 : p = 6%

Ha : p > 6%

1-sample proportions test without continuity correction

data: banglge[2] out of length(lgelot), null probability 0.

X-squared = 0.29551, df = 1, p-value = 0.

alternative hypothesis: true p is greater than 0.

95 percent confidence interval:

0.05375494 1.

sample estimates:

p

0.

Because pvalue = 0_._ 2934 > 0_._ 05 → Accept H 0. Therefore, we cannot conclude that the proportion of the houses with size larger than 0.5 acres is bigger than 6%. at risk level α = 5%

c) The proportion of houses with a size of larger than 0.5 acres in houses without pool is smaller than the proportion of houses with a size of larger than 0.5 acres in houses with pool. p 1 is the proportion of houses with a size of larger than 0.5 acres in houses without pool

p 2 is the proportion of houses with a size of larger than 0.5 acres in houses with pool

H 0 : p 1 = p 2

Ha : p 1 < p 2

2-sample test for equality of proportions without continuity correction

data: c(tbl[3], tbl[4]) out of c(length(pool[pool == "0"]), length(pool[pool == "1"]))

X-squared = 45.904, df = 1, p-value = 0.

alternative hypothesis: less

95 percent confidence interval:

-1.0000000 -0.

sample estimates:

prop 1 prop 2

0.05206847 0.

Because pvalue < 0_._ 05 → Reject H 0. Therefore, we conclude that the proportion of houses with a size of larger than 0.5 acres in houses without pool is smaller than the proportion of houses with a size of larger than 0.5 acres in houses with pool at risk level α = 5%

II.Quantifier

a) The average selling price is about 100,000 dollars

H 0 : μ = 100000

Ha : μ ̸= 100000

One Sample t-test

data: sprice

t = 14.508, df = 1499, p-value < 0.

alternative hypothesis: true mean is not equal to 100000

95 percent confidence interval:

120490.4 126897.

sample estimates:

mean of x

123693.

Because pvalue < 0_._ 05 → Reject H 0. Therefore, we conclude the average selling price is not about 100, dollars at risk level α = 5%

Ha : μ 1 ̸= μ 2

Welch Two Sample t-test

data: age[pool == "1"] and age[pool == "0"]

t = 0.57344, df = 109.82, p-value = 0.

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-2.003482 3.

sample estimates:

mean of x mean of y

22.62245 21.

Because pvalue = 0_._ 5675 > 0_._ 05 → Accept H 0. Therefore, we conclude the average age of houses with swimming pools is equal to the average age of houses without swimming pools at risk level α = 5%

e) The houses with a size >0.5 acres have a higher average selling price compared to houses with a size <=0.5 acres. μ 1 : the average selling price of houses with a size >0.5 acres

μ 2 : the average selling price of houses with a size <=0.5 acres

H 0 : μ 1 = μ 2

Ha : μ 1 > μ 2

Welch Two Sample t-test

data: sprice[lgelot == "1"] and sprice[lgelot == "0"]

t = 10.206, df = 95.619, p-value < 0.

alternative hypothesis: true difference in means is greater than 0

95 percent confidence interval:

112023.6 Inf

sample estimates:

mean of x mean of y

249017.4 115220.

Because pvalue < 2_._ 2 e −^16 < 0_._ 05 → Reject H 0. Therefore, we conclude the houses with a size >0.5 acres have a higher average selling price compared to houses with a size <=0.5 acres at risk level α = 5%

f) Houses with a pool have a higher average selling price compared to houses without a pool μ 1 : the average selling price of houses with a pool

μ 2 : the average selling price of houses without a pool

H 0 : μ 1 = μ 2

Ha : μ 1 > μ 2

Welch Two Sample t-test

data: sprice[pool == "1"] and sprice[pool == "0"]

t = 5.9191, df = 100.48, p-value = 0.

alternative hypothesis: true difference in means is greater than 0

95 percent confidence interval:

48184.14 Inf

sample estimates:

mean of x mean of y

186285.4 119318.

Because pvalue = 2_._ 262 e −^08 < 0_._ 05 → Reject H 0. Therefore, we conclude the houses with a pool have a higher average selling price compared to houses without a pool at risk level α = 5%

V. Prediction model

a) Eliminate outlier from dataset

As above mention, sprice has 101 outliers acounted for 6.7% in the dataset. We will remove it.

[1] "Length of outliers: "

[1] 0.

[1] "dimension of current data: "

[1] 1399 7

After remove the outliers, we have some new outliers acounted for 1.4% which is not significant. Therefore, we can ignore it.

b) Split training and validate set

[1] "train set:"

sprice livarea beds baths lgelot age pool

1419 83000 15 3 2.5 0 18 0

496 93000 14 3 2.0 0 33 0

726 98000 20 4 3.0 0 22 0

228 99950 14 3 2.0 0 38 0

650 87500 12 3 1.5 0 36 0

1088 85000 12 3 2.0 0 16 0

[1] "validate set:"

sprice livarea beds baths lgelot age pool

1 138000 17 3 2.0 1 97 0

2 105700 21 4 2.5 0 18 0

9 125000 14 3 2.0 0 3 0

12 160000 19 4 2.5 0 4 0

13 151000 17 3 2.5 0 0 0

14 166000 19 4 2.0 0 0 0

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 2.079 on 976 degrees of freedom

Multiple R-squared: 0.7402, Adjusted R-squared: 0.

F-statistic: 927.1 on 3 and 976 DF, p-value: < 0.

Analysis of Variance Table

Response: livarea

Df Sum Sq Mean Sq F value Pr(>F)

sprice 1 8390.2 8390.2 1941.90 < 0.00000000000000022 ***

beds 1 2329.6 2329.6 539.18 < 0.00000000000000022 ***

baths 1 1297.0 1297.0 300.18 < 0.00000000000000022 ***

Residuals 976 4216.9 4.

---

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

Because pvalue < 2_._ 2 e −^16 → Reject H 0. This is a highly significant result. The null hypothesis should be rejected at any reasonable significance level. Therefore at least one variable can be used to explain. Moreover, we can see that at risk level α = 5%, sprice , beds, baths have the meaning to explain livarea in this model. Among that pvalue of beds is smaller than 0.5 then Reject H 0 , we can use to explain livarea.

Moreover, R^2 = 74_._ 02% means that 74_._ 02% of the observed variation in livarea can be explained by the linear regression relationship based on sprice, beds, baths.

After that we have the model: Y ˆ = − 2_._ 797 e +00^ + 5_._ 724 e −^05 sprice + 1_._ 684 e +00 beds + 3_._ 284 e +00 baths. This model shows that when increasing 1 dollar then living area increases 5_._ 724 e −^05 hundreds of square feet; increasing 1 number of beds then living area increases 1_._ 684 e +00^ hundreds of square feet; increasing 1 number of baths then living area increases 3_._ 284 e +00^ hundreds of square feet.

d) Exam multicollinearity of model:

Variables Tolerance VIF

1 sprice 0.7443339 1.

2 beds 0.6784026 1.

3 baths 0.5902340 1.

The VIF indexes of three variables sprice,beds, baths are not significant. Therefore we can ignore the multicollinearity of this model.

e) Predict validation data

Root mean square error: 2.

After having prediction values, we can see that root mean square error is equal to 2.326186. There are still some predicted values that differ significantly from the actual values. However, the model can be considered acceptable for now.

f) Exam the independence of the model

lag Autocorrelation D-W Statistic p-value

1 -0.01417278 2.026274 0.

Alternative hypothesis: rho != 0

Because pvalue = 0_._ 672 > 0_._ 05 → Accept H 0. Therefore, there is no correlation among the residuals at rist level α = 5%

g) Exam stability of model

studentized Breusch-Pagan test

data: fit

BP = 9.9009, df = 3, p-value = 0.

Because pvalue = 0_._ 01943 < 0_._ 05 → reject H 0. Therefore the variance of the residuals is not constant.

Shapiro-Wilk normality test

data: fit$residuals

W = 0.98954, p-value = 0.

When observing above plot, a fan or cone shape indicates the presence of heteroskedasticity. This is seen as a problem because linear regression assumes that the spread of residuals is constant across the plot. If there is an unequal scatter of residuals, the population used in the regression contains unequal variance, and therefore the analysis results may be invalid. As we can see, pvalue = 1_._ 885 e −^06 < 0_._ 05 → Reject H 0. Therefore the residuals do not adhere to normal distribution at risk level α = 5%.

VI. Solving heteroskedasticity

a) Transforming the outcome variable

We will transform the livarea by using a log transformation.

Call:

lm(formula = log(livarea) ~ sprice + beds + baths)

Residuals:

Min 1Q Median 3Q Max

-0.50160 -0.08441 0.00522 0.08497 0.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.5969125000 0.0252103386 63.34 <0.0000000000000002 ***

sprice 0.0000035846 0.0000001549 23.13 <0.0000000000000002 ***

beds 0.0986148648 0.0086832411 11.36 <0.0000000000000002 ***

baths 0.2002734697 0.0121420354 16.49 <0.0000000000000002 ***

Min 1Q Median 3Q Max

-0.053929 -0.008931 0.000680 0.008554 0.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.5899593067 0.0261919493 60.70 <0.0000000000000002 ***

sprice 0.0000037038 0.0000001625 22.80 <0.0000000000000002 ***

beds 0.0989562307 0.0089557488 11.05 <0.0000000000000002 ***

baths 0.1968430531 0.0121988256 16.14 <0.0000000000000002 ***

---

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.01408 on 976 degrees of freedom

Multiple R-squared: 0.7, Adjusted R-squared: 0.

F-statistic: 759.1 on 3 and 976 DF, p-value: < 0.

residuals

Density

As we can see when using studentized Breusch-Pagan test, pvalue = 0_._ 3555 > 0_._ 05 → Accept H 0. Therefore we can think that the variance of the residuals is constant at risk level α = 5%. However, when using Shapiro-Wilk normality test, pvalue = 0_._ 02724 < 0_._ 05 → Reject H 0. The residuals do not adhere to normal distribution at risk level α = 5%. However, the residuals adhere to normal distribution at risk level α = 2%. As we can see, the histogram is like having the normal distribution.

Root mean square error: 2.

Compare to model 1, RMSE is approximately equal.

c) Conclude

All three models does not have residuals that adhere to normal distribution at risk level α = 5. Among of them, model 3 is the best model that satisfied no correlation among the residuals, the variance of the residuals is constant, residuals that adhere to normal distribution at risk level lower. The problem may arise due to the influence of outliers in the variables even though they have been processed, or simply because the initial assumption of using a linear regression model to handle this dataset is not appropriate.

d) Choose other models with AIC standard

The best model is livarea ˆ = B 0 + B 1 sprice + B 2 beds + B 3 baths + B 4 lgelot + B 5 age according to AIC standard. As shown by above plot, age and lgelot do not have correlations with livarea. Let’s exam we can drop it.

H 0 : B ˆ 4 = B ˆ 5 = 0

Ha : ∃ Bi ̸= 0

Analysis of Variance Table

Model 1: livarea ~ sprice + beds + baths

Model 2: livarea ~ sprice + beds + baths + lgelot + age

Res.Df RSS Df Sum of Sq F Pr(>F)

1 976 4216.

2 974 4105.0 2 111.94 13.28 0.000002041 ***

---

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

Because pvalue = 0_._ 000002041 < 0_._ 05 → Reject H 0. Therefore, we can drop age and lgelot variables for this case.

e) Exam AIC model

Call:

lm(formula = train$livarea ~ train$sprice + train$beds + train$baths +

train$lgelot + train$age)

Residuals:

Min 1Q Median 3Q Max

-7.0995 -1.3897 -0.0748 1.2140 8.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3.995036902 0.463922547 -8.611 < 0.0000000000000002 ***

train$sprice 0.000058334 0.000002494 23.392 < 0.0000000000000002 ***

train$beds 1.681351070 0.133937491 12.553 < 0.0000000000000002 ***

train$baths 3.540180630 0.200511214 17.656 < 0.0000000000000002 ***

train$lgelot -0.929242160 0.369184659 -2.517 0.012 *

train$age 0.026609863 0.005505657 4.833 0.00000156 ***

---

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 2.053 on 974 degrees of freedom

Multiple R-squared: 0.7471, Adjusted R-squared: 0.

F-statistic: 575.6 on 5 and 974 DF, p-value: < 0.

lag Autocorrelation D-W Statistic p-value

1 -0.01297797 2.024146 0.

Alternative hypothesis: rho != 0