


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The advantages and disadvantages of using r for statistical analysis, focusing on its comprehensive library of statistical functions, steep learning curve, difficulty in writing new functions, and memory demands. Additionally, it explains the concept of r-squared (r2) as a measure of the proportion of variance explained by a statistical model.
Typology: Lecture notes
1 / 4
This page cannot be seen from the preview
Don't miss anything!
it easy to program new statistical methods. The graphics of the language allow easy production of advanced, publication-quality graphics. Since a wide variety of experts use the program, R includes a comprehensive library of statistical functions, including many cutting-edge statistical methods. In addition to this, many third-party spe- cialized methods are publicly available. And most important, R is free and open source. A common concern of beginning users of R is the steep learning curve involved in using it. Such concern stems from the fact that R is a command- driven environment. Consequently, the statistical analysis is performed in a series of steps, in which commands are typed out and the results from each step are stored in objects that can be used by fur- ther inquiries. This is contrary to other programs, such as SPSS and SAS, which require users to determine all characteristics of the analysis up front and provide extensive output, thus relying on the users to identify what is relevant to their initial question. Another source of complaints relates to the difficulty of writing new functions. The more complex the function, the more difficult it becomes to identify errors in syntax or logic. R will prompt the user with an error message, but no indication is given of the nature of the prob- lem or its location within the new code. Conse- quently, despite the advantage afforded by being able to add new functions to R, many users may find it frustrating to write new routines. In addi- tion, complex analyses and simulations in R tend to be very demanding on the computer memory and processor; thus, the more complex the anal- ysis, the longer the time necessary to complete the task, sometimes days. Large data sets or complex tasks place heavy demands on computer RAM, resulting in slow output.
Brandon K. Vaughn and Aline Orr
See also SAS; SPSS; Statistica; Systat
Web Sites
Comprehensive R Archive Network (CRAN): http:// CRAN.R-project.org The R Project for Statistical Computing: http://www .r-project.org
R-squared (R^2 ) is a statistic that explains the amount of variance accounted for in the rela- tionship between two (or more) variables. Some- time R^2 is called the coefficient of determination, and it is given as the square of a correlation coefficient. Given paired variables ðX (^) i; Y (^) iÞ, a linear model that explains the relationship between the vari- ables is given by
Y ¼ β 0 þ β 1 X þ e,
where e is a mean zero error. The parameters of the linear model can be estimated using the least squares method and denoted by β^ 0 and β^ 1 , respec- tively. The parameters are estimated by minimizing the sum of squared residuals between variable Yi and the model β 0 þ β 1 X (^) i, that is, ð β^ 0 ; β^ 1 Þ ¼ argmin β 0 ;β 1
ðYi β 0 þ β 1 X (^) iÞ^2.
It can be shown that the least squares estima- tions are
β^ ^ 0 ¼ Y X Sxy S (^) xx
and β^ 1 ¼
Sxy Sxx
where the sample cross-covariance Sxy is defined as
Sxy ¼
n
Xn
i¼ 1
ðX (^) i XÞðY (^) i YÞ ¼ XY X Y:
Statistical packages such as SAS, SPLUS, and R provide a routine for obtaining the least squares estimation. The estimated model is denoted as
Y^ ^ ¼ β^ 0 þ β^ 1 X:
With the above notations, the sum of squared errors (SSE), or the sum of squared residuals, is given by
Xn
i¼ 1
ðYi Y^ (^) iÞ
2 :
SSE measures the amount of variability in Y that is not explained by the model. Then how does one measure the amount of variability in Y that is explained by the model? To answer this question,
one needs to know the total variability present in the data. The total sum of squares (SST) is the measure of total variation in the Y variable and is defined as
Xn
i¼ 1
ðYi YÞ
2 ,
where Y is the sample mean of Y variables, that is,
n
Xn
i¼ 1
Y (^) i:
Since SSE is the minimum of the sum of squared residuals of any linear model, SSE is always smaller than SST. Then the amount of variability explained by the model is SST − SSE, which is denoted as the regression sum of squares (SSR), that is,
SST ¼ SST SSE:
The ratio SSR/SST = (SST − SSE)/SST mea- sures the proportion of variability explained by the model. The coefficient of determination (R^2 ) is defined as the ratio
The coefficient of determination is given as the ratio of variations explained by the model to the total variations present in Y. Note that the coefficient of determination ranges between 0 and 1. R^2 value is interpreted as the proportion of variation in Y that is explained by the model. R^2 ¼ 1 indicates that the model exactly explains the variability in Y, and hence the model must pass through every measurement ðXi, YiÞ. On the other hand, R^2 ¼ 0 indicates that the model does not explain any variability in Y. R^2 value larger than .5 is usually considered a significant relationship.
Case Study and Data
Consider the following paired measurements from Moore et al. (1989), based on occupational mor- tality records from 1970 to 1972 in England and Wales. The figures represent smoking rates and deaths from lung cancer for a number of occupa- tional groups.
77 84 137 116 117 123 94 128 116 155 102 101 111 118 93 113 88 104 102 88 91 104 104 129 107 86 112 96 113 144 110 139 125 113 133 146 115 128 105 115 87 79 91 85 100 120 76 60 66 51
For a set of occupational groups, the first variable is the smoking index (average 100), and the second variable is the lung cancer mortality index (average 100). Suppose we are interested in determining how much the lung cancer mortality index (Y vari- able) is influenced by the smoking index (X vari- able). Figure 1 shows the scatterplot of the smoking index versus the lung cancer mortality index. The straight line is the estimated linear model, and it is given by
Y ¼ 2 : 8853 þ 1 : 0875 X:
SSE can be easily computed using the formula
Xn
i¼ 1
Y i^2 β^ 0
Xn
i¼ 1
Y (^) i β^ 1
Xn
i¼ 1
X (^) i Y (^) i, ð 1 Þ
and SST can be computed using the formula
Xn
i¼ 1
Y i^2
n
Xn
i¼ 1
Yi
: ð 2 Þ
Further Readings
Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples of an indefinitely large population. Biometrika, 10, 507–521. Moore, D. S., & McCabe, G. P. (1989). Introduction to the practice of statistics. New York: W. H. Freeman. Nagelkerke, N. J. D. (1992). Maximum likelihood estimation of functional relationships. Lecture Notes in Statistics, 69, 110.
The radial plot is a graphical method for display- ing and comparing observations that have differing precisions. Standardized observations are plotted against the precisions, where precision is defined as the reciprocal of the standard error. The original observations are given by slopes of lines through the origin. A scale of slopes is sometimes drawn explicitly. Suppose, for example, that data are available on the degree classes obtained by students graduat- ing from a university and that we wish to com- pare, for different major subjects, the proportions of students who achieved upper second-class hon- ors or higher. Typically, different numbers of stu- dents graduate in different subjects. A radial plot will display the data as proportions so that they may be compared easily. Similarly, a radial plot can be used to compare other summary statistics (such as means, regression coefficients, odds ratios) observed for different sized groups, or event rates observed for differing time periods. Sometimes, particularly in the natural and phys- ical sciences, measurements intrinsically have dif- fering precisions because of natural variation in the source material and experimental procedure. For example, archaeological and geochronological dating methods usually produce an age estimate and its standard error for each of several crystal grains or rock samples, and the standard errors differ substantially. In this case, the age estimates may be displayed and compared using a radial plot in order to examine whether they agree or how they differ. A third type of application is in meta- analysis, such as in medicine, to compare esti- mated treatment effects from different studies.
Here the precisions of the estimates can vary greatly because of the differing study sizes and designs. In this context the graph is often called a Galbraith plot. In general, a radial plot is appli- cable when one wants to compare a number of estimates of some parameter of interest, for which the estimates have different standard errors. A basic question is, Do the estimates agree (within statistical variation) with a common value? If so, what value? A radial plot provides a visual assessment of the answer. Also, like many graphs, it allows other features of the data to be seen, such as whether the estimates differ systematically in some way, perhaps due to an underlying factor or mixture of populations, or whether there are anomalous values that need explanation. It is inherently not straightforward to compare individ- ual estimates, either numerically or graphically, when their precisions vary. In particular, simply plotting estimates with error bars does not allow such questions to be assessed. The term radial plot is also used for a display of directional data, such as wind directions and velocities or quantities observed at different times of day, via radial lines of different lengths emanat- ing from a central point. This type of display is not discussed in this entry.
Mathematical Properties
Let z 1 , z 2 ,.. ., zn denote n observations or esti- mates having standard errors σ 1 , σ 2 ,.. ., σn, which are either known or well estimated. Then we plot the points (xi, yi) given by xi = 1/σi and yi = (zi − 0 )/σi, where z 0 is a convenient reference value. Each yi has unit standard deviation, so each point has the same standard error with respect to the y scale, but estimates with higher precision plot farther from the origin on the x scale. The (cen- tered) observation (zi − z 0 ) is equal to yi/xi, which is the slope of the line joining (0, 0) and (xi, yi), so that values of z can be shown on a scale of slopes. Figure 1 illustrates these principles. Furthermore, if each zi is an unbiased estimate of the same quantity μ, say, then the points will scatter with unit standard deviation about a line from (0, 0) with slope μ − z 0. In particular, points scattering with unit standard deviation about the horizontal radius agree with the reference value z 0. This provides a simple visual assessment of how
1190 Radial Plot