Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Learning R Software: Advantages, Challenges, and Coefficient of Determination, Lecture notes of Logic

The advantages and disadvantages of using r for statistical analysis, focusing on its comprehensive library of statistical functions, steep learning curve, difficulty in writing new functions, and memory demands. Additionally, it explains the concept of r-squared (r2) as a measure of the proportion of variance explained by a statistical model.

Typology: Lecture notes

2021/2022

Uploaded on 08/05/2022

nguyen_99
nguyen_99 🇻🇳

4.2

(80)

1K documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
it easy to program new statistical methods. The
graphics of the language allow easy production of
advanced, publication-quality graphics. Since
a wide variety of experts use the program, R
includes a comprehensive library of statistical
functions, including many cutting-edge statistical
methods. In addition to this, many third-party spe-
cialized methods are publicly available. And most
important, R is free and open source.
A common concern of beginning users of R is
the steep learning curve involved in using it. Such
concern stems from the fact that R is a command-
driven environment. Consequently, the statistical
analysis is performed in a series of steps, in which
commands are typed out and the results from each
step are stored in objects that can be used by fur-
ther inquiries. This is contrary to other programs,
such as SPSS and SAS, which require users to
determine all characteristics of the analysis up
front and provide extensive output, thus relying on
the users to identify what is relevant to their initial
question.
Another source of complaints relates to the
difficulty of writing new functions. The more
complex the function, the more difficult it
becomes to identify errors in syntax or logic. R
will prompt the user with an error message, but
no indication is given of the nature of the prob-
lem or its location within the new code. Conse-
quently, despite the advantage afforded by being
able to add new functions to R, many users may
find it frustrating to write new routines. In addi-
tion, complex analyses and simulations in R tend
to be very demanding on the computer memory
and processor; thus, the more complex the anal-
ysis, the longer the time necessary to complete
the task, sometimes days.
Large data sets or complex tasks place heavy
demands on computer RAM, resulting in slow
output.
Brandon K. Vaughn and Aline Orr
See also SAS; SPSS; Statistica; Systat
Web Sites
Comprehensive R Archive Network (CRAN): http://
CRAN.R-project.org
The R Project for Statistical Computing: http://www
.r-project.org
R
2
R-squared (R
2
)isastatisticthatexplainsthe
amount of variance accounted for in the rela-
tionship between two (or more) variables. Some-
time R
2
is called the coefficient of determination,
and it is given as the square of a correlation
coefficient.
Given paired variables ðXi;YiÞ, a linear model
that explains the relationship between the vari-
ables is given by
Y¼β0þβ1Xþe,
where eis a mean zero error. The parameters of
the linear model can be estimated using the least
squares method and denoted by ^
β0and ^
β1,respec-
tively. The parameters are estimated by minimizing
the sum of squared residuals between variable Yi
and the model β0þβ1Xi,thatis,ð^
β0;^
β1Þ¼
argmin
β0;β1ðYiβ0þβ1XiÞ2.
It can be shown that the least squares estima-
tions are
^
β0¼
Y
XSxy
Sxx
and ^
β1¼Sxy
Sxx
,
where the sample cross-covariance Sxy is defined as
Sxy ¼1
nX
n
i¼1ðXi
XÞðYi
YÞ¼XY
X
Y:
Statistical packages such as SAS, SPLUS, and R
provide a routine for obtaining the least squares
estimation. The estimated model is denoted as
^
Y¼^
β0þ^
β1X:
With the above notations, the sum of squared
errors (SSE), or the sum of squared residuals, is
given by
SSE¼X
n
i¼1ðYi^
YiÞ2:
SSE measures the amount of variability in Y
that is not explained by the model. Then how does
one measure the amount of variability in Ythat is
explained by the model? To answer this question,
R
2
1187
pf3
pf4

Partial preview of the text

Download Learning R Software: Advantages, Challenges, and Coefficient of Determination and more Lecture notes Logic in PDF only on Docsity!

it easy to program new statistical methods. The graphics of the language allow easy production of advanced, publication-quality graphics. Since a wide variety of experts use the program, R includes a comprehensive library of statistical functions, including many cutting-edge statistical methods. In addition to this, many third-party spe- cialized methods are publicly available. And most important, R is free and open source. A common concern of beginning users of R is the steep learning curve involved in using it. Such concern stems from the fact that R is a command- driven environment. Consequently, the statistical analysis is performed in a series of steps, in which commands are typed out and the results from each step are stored in objects that can be used by fur- ther inquiries. This is contrary to other programs, such as SPSS and SAS, which require users to determine all characteristics of the analysis up front and provide extensive output, thus relying on the users to identify what is relevant to their initial question. Another source of complaints relates to the difficulty of writing new functions. The more complex the function, the more difficult it becomes to identify errors in syntax or logic. R will prompt the user with an error message, but no indication is given of the nature of the prob- lem or its location within the new code. Conse- quently, despite the advantage afforded by being able to add new functions to R, many users may find it frustrating to write new routines. In addi- tion, complex analyses and simulations in R tend to be very demanding on the computer memory and processor; thus, the more complex the anal- ysis, the longer the time necessary to complete the task, sometimes days. Large data sets or complex tasks place heavy demands on computer RAM, resulting in slow output.

Brandon K. Vaughn and Aline Orr

See also SAS; SPSS; Statistica; Systat

Web Sites

Comprehensive R Archive Network (CRAN): http:// CRAN.R-project.org The R Project for Statistical Computing: http://www .r-project.org

R

R-squared (R^2 ) is a statistic that explains the amount of variance accounted for in the rela- tionship between two (or more) variables. Some- time R^2 is called the coefficient of determination, and it is given as the square of a correlation coefficient. Given paired variables ðX (^) i; Y (^) iÞ, a linear model that explains the relationship between the vari- ables is given by

Y ¼ β 0 þ β 1 X þ e,

where e is a mean zero error. The parameters of the linear model can be estimated using the least squares method and denoted by β^ 0 and β^ 1 , respec- tively. The parameters are estimated by minimizing the sum of squared residuals between variable Yi and the model β 0 þ β 1 X (^) i, that is, ð β^ 0 ; β^ 1 Þ ¼ argmin β 0 ;β 1

ðYi  β 0 þ β 1 X (^) iÞ^2.

It can be shown that the least squares estima- tions are

β^ ^ 0 ¼ Y  X Sxy S (^) xx

and β^ 1 ¼

Sxy Sxx

where the sample cross-covariance Sxy is defined as

Sxy ¼

n

Xn

i¼ 1

ðX (^) i  XÞðY (^) i  YÞ ¼ XY  X Y:

Statistical packages such as SAS, SPLUS, and R provide a routine for obtaining the least squares estimation. The estimated model is denoted as

Y^ ^ ¼ β^ 0 þ β^ 1 X:

With the above notations, the sum of squared errors (SSE), or the sum of squared residuals, is given by

SSE¼

Xn

i¼ 1

ðYi  Y^ (^) iÞ

2 :

SSE measures the amount of variability in Y that is not explained by the model. Then how does one measure the amount of variability in Y that is explained by the model? To answer this question,

one needs to know the total variability present in the data. The total sum of squares (SST) is the measure of total variation in the Y variable and is defined as

SST¼

Xn

i¼ 1

ðYi  YÞ

2 ,

where Y is the sample mean of Y variables, that is,

Y^ ¼ 1

n

Xn

i¼ 1

Y (^) i:

Since SSE is the minimum of the sum of squared residuals of any linear model, SSE is always smaller than SST. Then the amount of variability explained by the model is SST − SSE, which is denoted as the regression sum of squares (SSR), that is,

SST ¼ SST  SSE:

The ratio SSR/SST = (SST − SSE)/SST mea- sures the proportion of variability explained by the model. The coefficient of determination (R^2 ) is defined as the ratio

R^2 ¼

SSR

SST

SST  SSE

SSE

The coefficient of determination is given as the ratio of variations explained by the model to the total variations present in Y. Note that the coefficient of determination ranges between 0 and 1. R^2 value is interpreted as the proportion of variation in Y that is explained by the model. R^2 ¼ 1 indicates that the model exactly explains the variability in Y, and hence the model must pass through every measurement ðXi, YiÞ. On the other hand, R^2 ¼ 0 indicates that the model does not explain any variability in Y. R^2 value larger than .5 is usually considered a significant relationship.

Case Study and Data

Consider the following paired measurements from Moore et al. (1989), based on occupational mor- tality records from 1970 to 1972 in England and Wales. The figures represent smoking rates and deaths from lung cancer for a number of occupa- tional groups.

77 84 137 116 117 123 94 128 116 155 102 101 111 118 93 113 88 104 102 88 91 104 104 129 107 86 112 96 113 144 110 139 125 113 133 146 115 128 105 115 87 79 91 85 100 120 76 60 66 51

For a set of occupational groups, the first variable is the smoking index (average 100), and the second variable is the lung cancer mortality index (average 100). Suppose we are interested in determining how much the lung cancer mortality index (Y vari- able) is influenced by the smoking index (X vari- able). Figure 1 shows the scatterplot of the smoking index versus the lung cancer mortality index. The straight line is the estimated linear model, and it is given by

Y ¼  2 : 8853 þ 1 : 0875 X:

SSE can be easily computed using the formula

SSE¼

Xn

i¼ 1

Y i^2  β^ 0

Xn

i¼ 1

Y (^) i  β^ 1

Xn

i¼ 1

X (^) i Y (^) i, ð 1 Þ

and SST can be computed using the formula

SST¼

Xn

i¼ 1

Y i^2 

n

Xn

i¼ 1

Yi

: ð 2 Þ

Further Readings

Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples of an indefinitely large population. Biometrika, 10, 507–521. Moore, D. S., & McCabe, G. P. (1989). Introduction to the practice of statistics. New York: W. H. Freeman. Nagelkerke, N. J. D. (1992). Maximum likelihood estimation of functional relationships. Lecture Notes in Statistics, 69, 110.

RADIAL P LOT

The radial plot is a graphical method for display- ing and comparing observations that have differing precisions. Standardized observations are plotted against the precisions, where precision is defined as the reciprocal of the standard error. The original observations are given by slopes of lines through the origin. A scale of slopes is sometimes drawn explicitly. Suppose, for example, that data are available on the degree classes obtained by students graduat- ing from a university and that we wish to com- pare, for different major subjects, the proportions of students who achieved upper second-class hon- ors or higher. Typically, different numbers of stu- dents graduate in different subjects. A radial plot will display the data as proportions so that they may be compared easily. Similarly, a radial plot can be used to compare other summary statistics (such as means, regression coefficients, odds ratios) observed for different sized groups, or event rates observed for differing time periods. Sometimes, particularly in the natural and phys- ical sciences, measurements intrinsically have dif- fering precisions because of natural variation in the source material and experimental procedure. For example, archaeological and geochronological dating methods usually produce an age estimate and its standard error for each of several crystal grains or rock samples, and the standard errors differ substantially. In this case, the age estimates may be displayed and compared using a radial plot in order to examine whether they agree or how they differ. A third type of application is in meta- analysis, such as in medicine, to compare esti- mated treatment effects from different studies.

Here the precisions of the estimates can vary greatly because of the differing study sizes and designs. In this context the graph is often called a Galbraith plot. In general, a radial plot is appli- cable when one wants to compare a number of estimates of some parameter of interest, for which the estimates have different standard errors. A basic question is, Do the estimates agree (within statistical variation) with a common value? If so, what value? A radial plot provides a visual assessment of the answer. Also, like many graphs, it allows other features of the data to be seen, such as whether the estimates differ systematically in some way, perhaps due to an underlying factor or mixture of populations, or whether there are anomalous values that need explanation. It is inherently not straightforward to compare individ- ual estimates, either numerically or graphically, when their precisions vary. In particular, simply plotting estimates with error bars does not allow such questions to be assessed. The term radial plot is also used for a display of directional data, such as wind directions and velocities or quantities observed at different times of day, via radial lines of different lengths emanat- ing from a central point. This type of display is not discussed in this entry.

Mathematical Properties

Let z 1 , z 2 ,.. ., zn denote n observations or esti- mates having standard errors σ 1 , σ 2 ,.. ., σn, which are either known or well estimated. Then we plot the points (xi, yi) given by xi = 1/σi and yi = (zi − 0 )/σi, where z 0 is a convenient reference value. Each yi has unit standard deviation, so each point has the same standard error with respect to the y scale, but estimates with higher precision plot farther from the origin on the x scale. The (cen- tered) observation (zi − z 0 ) is equal to yi/xi, which is the slope of the line joining (0, 0) and (xi, yi), so that values of z can be shown on a scale of slopes. Figure 1 illustrates these principles. Furthermore, if each zi is an unbiased estimate of the same quantity μ, say, then the points will scatter with unit standard deviation about a line from (0, 0) with slope μ − z 0. In particular, points scattering with unit standard deviation about the horizontal radius agree with the reference value z 0. This provides a simple visual assessment of how

1190 Radial Plot