Design and Analysis of Clinical Study 5. Introduction to R and Statistics

About This Presentation
Title:

Design and Analysis of Clinical Study 5. Introduction to R and Statistics

Description:

Title: Prevalence of Breast and Bottle Feeding Author: SWSAHS Last modified by: Dr Tuan V Nguyen Created Date: 3/30/2000 2:12:19 AM Document presentation format –

Number of Views:78
Avg rating:3.0/5.0
Slides: 28
Provided by: sws049
Category:

less

Transcript and Presenter's Notes

Title: Design and Analysis of Clinical Study 5. Introduction to R and Statistics


1
Design and Analysis of Clinical Study 5.
Introduction to R and Statistics
  • Dr. Tuan V. Nguyen
  • Garvan Institute of Medical Research
  • Sydney, Australia

2
Outline
  • Introduction
  • Historical development
  • S, Splus
  • Capability
  • Statistical Analysis
  • References
  • Calculator
  • Data Type
  • Resources
  • Simulation and Statistical Tables
  • Probability distributions

3
History
  • S an interactive environment for data analysis
    developed at Bell Laboratories since 1976
  • 1988 - S2 RA Becker, JM Chambers, A Wilks
  • 1992 - S3 JM Chambers, TJ Hastie
  • 1998 - S4 JM Chambers
  • Exclusively licensed by ATT/Lucent to Insightful
    Corporation, Seattle WA. Product name "S-plus".
  • R initially written by Ross Ihaka and Robert
    Gentleman at Dep. of Statistics of U of Auckland,
    New Zealand during 1990s.
  • Since 1997 international "R-core" team of ca. 15
    people with access to common CVS archive.

4
Introduction
  • R is "GNU S" A language and environment for
    data manipula-tion, calculation and graphical
    display.
  • a suite of operators for calculations on arrays
    matrices.
  • a large, coherent, integrated collection of data
    analysis tools
  • graphical facilities for data analysis
  • a well developed programming language

5
What R does and does not
  • is not a database, but connects to DBMSs
  • has no graphical user interfaces, but connects to
    Java, TclTk
  • language interpreter can be very slow, but allows
    to call own C/C code
  • no spreadsheet view of data, but connects to
    Excel/MsOffice
  • no professional / commercial support
  • data handling and storage numeric, textual
  • matrix algebra
  • hash tables and regular expressions
  • high-level data analytic and statistical
    functions
  • graphics
  • programming language loops, branching,
    subroutines

6
R and Statistics
  • Packaging a crucial infrastructure to
    efficiently produce, load and keep consistent
    software libraries from (many) different sources
    / authors
  • Statistics most packages deal with statistics
    and data analysis
  • State of the art many statistical researchers
    provide their methods as R packages

7
Data Analysis and Presentation
  • The R distribution contains functionality for
    large number of statistical procedures.
  • linear and generalized linear models
  • nonlinear regression models
  • time series analysis
  • classical parametric and nonparametric tests
  • clustering
  • smoothing
  • R also has a large set of functions which provide
    a flexible graphical environment for creating
    various kinds of data presentations.

8
R as a calculator
  • gt log2(32)
  • 1 5
  • gt sqrt(2)
  • 1 1.414214
  • gt seq(0, 5, length6)
  • 1 0 1 2 3 4 5
  • gt plot(sin(seq(0, 2pi, length100)))

9
Objects
  • Primitive (or atomic) data types in R are
  • numeric (integer, double, complex)
  • character
  • logical
  • function
  • out of these, vectors, arrays, lists can be
    built.

10
R "grammar"
  • object lt- function(argument1, argument2, ...,
    argumentn)
  • Example
  • gt reg lt- lm(y x)
  • x 5 x b?ng 5
  • x ! 5 x không b?ng 5
  • y lt x y nh? hon x
  • x gt y x l?n hon y
  • z lt 7 z nh? hon ho?c b?ng 7
  • p gt 1 p l?n hon ho?c b?ng 1
  • is.na(x) Có ph?i x là bi?n s? missing
  • A B A và B (AND)
  • A B A ho?c B (OR)
  • ! Không là (NOT)

11
Reading Data 1 Direct Method
  • age insulin
  • 50 16.5
  • 62 10.8
  • 60 32.3
  • 40 19.3
  • 48 14.2
  • 47 11.3
  • 57 15.5
  • 70 15.8
  • 48 16.2
  • 67 11.2

gt age lt- c(50,62, 60,40,48,47,57,70,48,67) gt
insulin lt-c(16.5,10.8,32.3,19.3,14.2,11.3,15.5,15.
8,16.2,11.2) gt ins lt- data.frame(age, insulin)
12
Reading Data 2 read.table
  • id sex age bmi hdl ldl tc
    tg
  • 1 Nam 57 17 5.000 2.0 4.0
    1.1
  • 2 Nu 64 18 4.380 3.0 3.5
    2.1
  • 3 Nu 60 18 3.360 3.0 4.7
    0.8
  • 4 Nam 65 18 5.920 4.0 7.7
    1.1
  • 5 Nam 47 18 6.250 2.1 .
    2.1
  • 6 Nu 65 18 4.150 3.0 4.2
    1.5
  • 7 Nam 76 19 0.737 3.0 5.9
    2.6

gt setwd("c/works/stats") gt chol lt-
read.table("chol.txt", headerTRUE,
na.missing".")
13
Reading Data 3 read.csv
  • Bu?c 1 Dùng l?nh "Save as" trong Excel và luu s?
    li?u du?i d?ng "csv"
  • Bu?c 2 Dùng R (l?nh read.csv) d? nh?p d? li?u
    d?ng csv
  • gt setwd("c/works/stats")
  • gt gh lt- read.csv ("excel.csv", headerTRUE)

14
A Simple Session
  • sex lt- c("Nam", "Nu", "Nu","Nam","Nam",
    "Nu","Nam","Nam","Nam", "Nu", "Nu","Nam",
    "Nu","Nam","Nam", "Nu", "Nu", "Nu", "Nu", "Nu",
    "Nu", "Nu", "Nu", "Nu","Nam","Nam", "Nu","Nam",
    "Nu", "Nu", "Nu","Nam","Nam", "Nu", "Nu","Nam",
    "Nu","Nam", "Nu", "Nu",
  • "Nam", "Nu","Nam","Nam","Nam",
    "Nu","Nam","Nam", "Nu", "Nu")
  • age lt- c(57, 64, 60, 65, 47, 65, 76, 61, 59, 57,
    63, 51, 60, 42, 64, 49, 44, 45, 80, 48,
  • 61, 45, 70, 51, 63, 54, 57, 70, 47, 60,
    60, 50, 60, 55, 74, 48, 46, 49, 69, 72,
  • 51, 58, 60, 45, 63, 52, 64, 45, 64, 62)
  • bmi lt- c( 17, 18, 18, 18, 18, 18, 19, 19, 19, 19,
    20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21,
    21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
    23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24,
    24, 24, 25, 25)
  • hdl lt- c(5.000, 4.380, 3.360, 5.920, 6.250,
    4.150, 0.737, 7.170, 6.942, 5.000, 4.217, 4.823,
    3.750, 1.904, 6.900, 0.633, 5.530, 6.625, 5.960,
    3.800, 5.375, 3.360, 5.000, 2.608, 4.130,5.000,
    6.235, 3.600, 5.625, 5.360, 6.580, 7.545,
    6.440,6.170,5.270, 3.220, 5.400, 6.300, 9.110,
    7.750, 6.200, 7.050, 6.300, 5.450,5.000,3.360,7.17
    0,7.880,7.360,7.750)
  • ldl lt- c(2.0, 3.0, 3.0, 4.0, 2.1, 3.0, 3.0,
    3.0, 3.0, 2.0, 5.0, 1.3, 1.2, 0.7, 4.0,
    4.1, 4.3, 4.0, 4.3, 4.0, 3.1, 3.0, 1.7,
    2.0, 2.1, 4.0, 4.1, 4.0, 4.2, 4.2, 4.4,
    4.3, 2.3, 6.0, 3.0, 3.0, 2.6, 4.4, 4.3,
    4.0, 3.0, 4.1, 4.4, 2.8, 3.0, 2.0, 1.0,
    4.0, 4.6, 4.0)
  • tc lt-c (4.0, 3.5, 4.7, 7.7, 5.0, 4.2, 5.9, 6.1,
    5.9, 4.0, 6.2, 4.1, 3.0, 4.0, 6.9, 5.7, 5.7, 5.3,
    7.1, 3.8, 4.3, 4.8, 4.0, 3.0, 3.1, 5.3, 5.3, 5.4,
    4.5, 5.9, 5.6, 8.3, 5.8, 7.6, 5.8, 3.1, 5.4, 6.3,
    8.2, 6.2, 6.2, 6.7, 6.3, 6.0, 4.0, 3.7, 6.1, 6.7,
    8.1, 6.2)
  • tg lt- c(1.1, 2.1, 0.8, 1.1, 2.1, 1.5, 2.6, 1.5,
    5.4, 1.9, 1.7, 1.0, 1.6, 1.1, 1.5, 1.0, 2.7, 3.9,
    3.0, 3.1, 2.2, 2.7, 1.1, 0.7, 1.0, 1.7, 2.9, 2.5,
    6.2, 1.3, 3.3, 3.0, 1.0, 1.4, 2.5, 0.7, 2.4, 2.4,
    1.4, 2.7, 2.4, 3.3, 2.0, 2.6, 1.8, 1.2, 1.9, 3.3,
    4.0, 2.5)
  • cong lt- data.frame(sex, age, bmi, hdl, ldl, tc,
    tg)
  • attach(cong)

15
Bar Graph
  • gt sex.freq lt- table(sex)
  • gt sex.freq
  • sex
  • Nam Nu
  • 22 28
  • gt barplot(sex.freq, main"Frequency of males and
    females")
  • gt barplot(table(sex), main"Frequency of males
    and females")
  • gt stripchart(tg, main"Strip chart for
    triglycerides", xlab"mg/L")

16
Histogram, Boxplot
  • gt hist(age)
  • gt hist(age, main"Frequency distribution by age
    group", xlab"Age group", ylab"No of patients")
  • gt plot(density(age),addTRUE)
  • gt boxplot(tc, main"Box plot of total
    cholesterol", ylab"mg/L")
  • gt boxplot(tcsex, horizontalTRUE, main"Box plot
    of total cholesterol", ylab"mg/L", col "pink")

17
Multiple Graphs
  • gt op lt- par(mfrowc(2,3))
  • gt hist(tc)
  • gt hist(hdl)
  • gt hist(ldl)
  • gt hist(tg)
  • gt hist(bmi)
  • gt hist(age)

18
Scatter Plots
  • gt plot(tc, hdl)
  • gt plot(hdl, tc, pchifelse(sex"Nam", 16, 22))
  • gt plot(hdl, tc, pchifelse(sex"Nam", "M", "F"))
  • gt plot(hdl tc, pch16, main"Total cholesterol
    and HDL cholesterol with LOEWSS smooth function",
    xlab"Total cholesterol", ylab"HDL cholesterol",
    bty"l")
  • gt lines(lowess(hdl, tc, f2/3, iter3),
    col"red")
  • gt lipid lt- data.frame(age,bmi,hdl,ldl,tc)
  • gt pairs(lipid, pch16)

19
Descriptive Statistics
  • gt mean(tc)
  • 1 5.414
  • gt var(tc)
  • 1 1.962045
  • gt sd(tc)
  • 1 1.40073
  • gt summary(cong)
  • sex age bmi
    hdl ldl
  • Nam22 Min. 42.00 Min. 17.00 Min.
    0.633 Min. 0.700
  • Nu 28 1st Qu.49.25 1st Qu.20.00 1st
    Qu.4.167 1st Qu.2.650
  • Median 59.50 Median 22.00 Median
    5.425 Median 3.050
  • Mean 57.64 Mean 21.38 Mean
    5.333 Mean 3.292
  • 3rd Qu.63.75 3rd Qu.23.00 3rd
    Qu.6.545 3rd Qu.4.100
  • Max. 80.00 Max. 25.00 Max.
    9.110 Max. 6.000
  • tc tg
  • Min. 3.000 Min. 0.700

20
Descriptive Statistics by Group, t-test
  • gt tapply(tc, list(sex), mean)
  • Nam Nu
  • 5.554545 5.303571
  • gt t.test(tc sex, datacong)
  • Welch Two Sample t-test
  • data tc by sex
  • t 0.6283, df 46.09, p-value 0.5329
  • alternative hypothesis true difference in means
    is not equal to 0
  • 95 percent confidence interval
  • -0.553024 1.054972
  • sample estimates
  • mean in group Nam mean in group Nu
  • 5.554545 5.303571

21
Wilcoxon test
  • gt wilcox.test(tc sex, datacong)
  • Wilcoxon rank sum test with continuity
    correction
  • data tc by sex
  • W 355, p-value 0.3629
  • alternative hypothesis true mu is not equal to 0
  • Warning message
  • cannot compute exact p-value with ties in
    wilcox.test.default(x c(4, 7.7, 5, 5.9, 6.1,
    5.9, 4.1, 4, 6.9,

22
Test for Two Proportions
  • gt fracture lt- c(7, 20)
  • gt total lt- c(100, 110)
  • gt prop.test(fracture, total)
  • 2-sample test for equality of proportions
    with continuity correction
  • data fracture out of total
  • X-squared 4.8901, df 1, p-value 0.02701
  • alternative hypothesis two.sided
  • 95 percent confidence interval
  • -0.20908963 -0.01454673
  • sample estimates
  • prop 1 prop 2
  • 0.0700000 0.1818182

23
Comparison of Multiple Proportions
  • gt female lt- c( 4, 43, 22, 0)
  • gt total lt- c(8, 60, 30, 2)
  • gt prop.test(female, total)
  • 4-sample test for equality of proportions
    without continuity
  • correction
  • data female out of total
  • X-squared 6.2646, df 3, p-value 0.09942
  • alternative hypothesis two.sided
  • sample estimates
  • prop 1 prop 2 prop 3 prop 4
  • 0.5000000 0.7166667 0.7333333 0.0000000
  • Warning message
  • Chi-squared approximation may be incorrect in
    prop.test(female, total)

24
Linear Regression Analysis
  • gt age lt- c(46,20,52,30,57,25,28,36,22,43,57,33,22,
    63,40,48,28,49)
  • gt bmi lt-c(25.4,20.6,26.2,22.6,25.4,23.1,22.7,24.9,
    19.8,25.3,23.2, 21.8,20.9,26.7,26.4,21.2,21.2,22.8
    )
  • gt chol lt- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.
    8,4.1,3.0,
  • 2.5,4.6,3.2, 4.2,2.3,4.0)
  • gt data lt- data.frame(age, bmi, chol)
  • gt plot(chol age, pch16)

25
Coefficient of Correlation
  • gt cor.test(age, chol)
  • Pearson's product-moment correlation
  • data age and chol
  • t 10.7035, df 16, p-value 1.058e-08
  • alternative hypothesis true correlation is not
    equal to 0
  • 95 percent confidence interval
  • 0.8350463 0.9765306
  • sample estimates
  • cor
  • 0.936726

26
Simple Linear Regression Analysis
  • gt reg lt- lm(chol age)
  • gt summary(reg)
  • Call
  • lm(formula chol age)
  • Residuals
  • Min 1Q Median 3Q Max
  • -0.40729 -0.24133 -0.04522 0.17939 0.63040
  • Coefficients
  • Estimate Std. Error t value Pr(gtt)
  • (Intercept) 1.089218 0.221466 4.918 0.000154
  • age 0.057788 0.005399 10.704 1.06e-08
  • ---
  • Signif. codes 0 '' 0.001 '' 0.01 '' 0.05
    '.' 0.1 ' ' 1
  • Residual standard error 0.3027 on 16 degrees of
    freedom
  • Multiple R-Squared 0.8775, Adjusted
    R-squared 0.8698
  • F-statistic 114.6 on 1 and 16 DF, p-value
    1.058e-08

27
Multiple Linear Regression Analysis
  • gt mreg lt- lm(chol age bmi)
  • gt summary(mreg)
  • Call
  • lm(formula chol age bmi)
  • Residuals
  • Min 1Q Median 3Q Max
  • -0.3762 -0.2259 -0.0534 0.1698 0.5679
  • Coefficients
  • Estimate Std. Error t value Pr(gtt)
  • (Intercept) 0.455458 0.918230 0.496 0.627
  • age 0.054052 0.007591 7.120 3.50e-06
  • bmi 0.033364 0.046866 0.712 0.487
  • ---
  • Signif. codes 0 '' 0.001 '' 0.01 '' 0.05
    '.' 0.1 ' ' 1
  • Residual standard error 0.3074 on 15 degrees of
    freedom
  • Multiple R-Squared 0.8815, Adjusted
    R-squared 0.8657
  • F-statistic 55.77 on 2 and 15 DF, p-value
    1.132e-07
Write a Comment
User Comments (0)
About PowerShow.com