Title: Basic principles of probability theory
1Name Garib Murshudov (when asking questions
Garib is sufficient) e-mail garib_at_ysbl.york.ac.u
k location Bioscience Building (New Biology),
K066 webpage for lecture notes and
exercises www.ysbl.york.ac.uk/garib/mres_course/2
006/ There will be two types of exercises With
numbers. They will be marked.
With
names. You can do them and I will mark
them. You can send questions about this course
and other questions I can help with to the above
e-mail address.
2Additional materials
- Linear and matrix algebra
- Eigenvalue/eigenvector decomposition
- Singular value decomposition
- Operation on matrices and vectors
- Basics of probabilities and statistics
- Probability concept
- Characterstic/moment generating/cumulative
generating functions - Entropy and maximum entropy
- Some standard distributions (e.g. normal, t, F,
chisq distributions) - Point and interval estimation
- Elements of hypothesis testing
- Sampling and sampling distributions
- Optimisation techniques
- Gradient methods
- Super-linear and second order techniques
3Introduction to R
- Example of analysis in this course will be done
using R. You can use any package you are familiar
with. However I may not be able to help in these
cases. - R is a multipurpose statistical package. It is
freely available from - http//www.r-project.org/
- Or just type R on your google search. The first
hit is usually hyperlink to R. - It should be straightforward to download.
- R is an environment (in unix terminology it is
some sort of shell) that offers from simple
calculation to sophisticated statistical
functions. - You can run programs available in R or write your
own script using these programs. Or you can also
write a program using your favourite language
(C,C,FORTRAN) and put it in R. - If you have a mind of a programmer then it is
perfect for you. If you have a mind of a user it
gives you very good options to do what you want
to do. - Here I give a very brief introduction to some of
the commands of R. During the course I will give
some other useful commands for each technique.
4To get started
- If you are using Windows Once you have
downloaded R (the University has already that)
then you can either follow the path
Start/Programs/R or if you have a shortcut to R
version double click that icon. Then you will
have R window - If you are using unix/linux/MacOS/ After
defining path where R executables are just type R
in one of your windows. Usually path is defined
during download time. - Useful commands for beginners
- help.start()
- will usually start a web browser and you can
start learning. A very useful section is An
Introduction to R. There is a search engine
also. - To get information about a command you just type
- ?command
- It will give some sort of help (sometimes helpful
help). - command()
- Gives R script if available. Reading these
scripts may help you to write your own script or
program
5Simple commands assignment
- The simplest command is that of assignment
- v5.0
- or
- v lt- 5.0
- the value of the variable v will become 5.0
(Although there are several ways for assignment I
will always use ) - If you type
- v c(1.0,2.0,10.0,1.5,2.5,6.5)
- will make a vector with length 6.
- if you type
- v
- R will print the value(s) of the variable v.
- vc(mine,yours,his/hers,theirs,its)
- will create a vector of characters. Type of
variable is defined on fly. - To access particular value of a vector use for
example - v1 the first element
6To create a matrix
- The simplest way to create a matrix is to create
a vector then convert it to a matrix - c vector(len100)
- c1100 (The values of c will become integers
from 1 to 100) - dim(c ) c(5,20)
- c
- The second command will work whenever you have a
vector. The resulting c will be a matrix with
dimensions 5x20. - You can also use
- d matrix(c,c(5,20)) or d matrix(c,nrow5)
or dmatrix(c,ncol20) - d
- then c will be kept intact and d will become a
matrix. You can also give names to the columns
and rows (LETTERS is a built in vector of the
English letters) - rownames(d) LETTERS15
- colnames(d) LETTERS120
7Simple calculations arithmetic
- Almost all elementary functions are available
- exp(v)
- log(v)
- tan(v)
- cos(v) and others
- These functions are applied to all elements of
the vector (or matrix). Types of the value of
these function are the same as the types of the
arguments. It will of course fail if v is a
vector of characters and you are trying to use a
function with real argument or the values are
outside of the range of functions argument
space. - Apart from elementary functions there are many
built in special functions like Bessel functions
(besselI(x,n), besselK(x,n) etc), gamma functions
and many others. Just have a look help.start()
and use Search engine and Keywords
8Two more commands for sorting
- There are two commands for sorting. One of them
is - sort(randu,1)
- It just sorts the data in an ascending order. It
has a limited use. Another, more important one
does not sort but creates a vector of indices
that corresponds to a sorted data. That is - order(randu,1)
- It gives position of the ordered data. It can
now be used to access data in an ordered form.
sort(data) and dataorder(data) are equivalent. - randuorder(randu,1),
- will change rows of the data so that the first
column is sorted..
9Reading from files
- The simplest way of reading from a file of table
is to use - d read.table(name of the file)
- It will read that table from the file (you may
have some problems if you are using windows). Do
not forget to put end of line for the final line
if you are using windows. - scan is also a useful command for reading.
- d scan(filename of the file)
- There are options to read files from various stat
packages. For example read.csv, read.csv2
10Built in data
- R has numerous built in datasets. You can view
them using - data()
- You can pick one of them and play with it. It is
always good idea to have a look what kind of
data you are working with. There are also helps
for R datasets - data(DNase)
- ?DNase
- It will print information about DNase.
- You can have all available data sets using
- data(package .packages(all.available TRUE))
- To take a data set from another package you can
load the corresponding library using - library(name of library)
- and then you can read data set. This command will
load all functions in that library also - Once you have data you can start analysing them
11Simple statistics
- The simplest statistics you can use are mean,
variance and standard deviations - data(randu)
- mean(randu,2)
- var(randu,2)
- sd(randu,2)
- will calculate mean, variance and standard
deviation of the column 2 of the data randu - Another useful command is
- summary(randu,2)
- It gives minimum, 1st quartile, median, mean, 3rd
quartile and maximum values
12Simple two sample statistics
- Covariance between two samples
- cov(randu,1,randu,2)
- Correlation between two samples
- cor(randu,1,randu,2)
- When you have a matrix (columns are variables and
rows are observations) - cov(randu)
- will calculate covariance between columns
- cor(randu)
- will calculate correlation between columns
- If rows are observations then you can use the
transpose of the matrix - cov(t(randu))
13Simple plots
- There are several useful plot functions. We will
learn some of them during the course. Here are
the simplest ones - plot(randu,2)
- Plots values vs indices. The x axis is index of a
data point and the y axis is its value
14Simple plots boxplot
- Another useful plot is boxplot.
- boxplot(randu,2)
- It produces a boxplot. It is a useful plot that
may show extreme outliers and overall behaviour
of the data under consideration. It plots median,
1st, 3rd quantiles, minimum and maximum values.
In some sense it a graphical representation of
command summary
15Simple plots histogram
- Histogram is another useful command. It may give
some idea about the underlying distribution - hist(randu,2)
- will plot histogram. x axis is value of the data
and the y axis is number of occurrences
16Simple plots qqplot
- Useful way of checking if data obey a particular
distribution - qqnorm(randu,2)
- is useful to see if the distribution is normal.
It must be linear. Clearly it is not normal
17Simple qqplot
- Let us test another one. Uniform distribution
- qqplot(randu,2,runif(1000))
- runif is a random number generator from the
uniform distribution. It is a useful command. - The result is (It looks much better)
18Further reading
- Introduction to R from package R
- Dalgaard, P. Introductory Statistics with R