Recent Question/Assignment

Lab 2: Estimating Probabilities and Exploratory Statistics
IF YOU NEED HELP: Review the following
Logical operators (R Handbook)
Indexing of data frames/matrices/vectors (R Handbook)
Setting:
As an environmental consultant working on development planning for the eastern United States, you have successfully been able to locate and begin to explore data on sea duck wintering areas (Lab 1!). Now that you have explored and mapped some of the raw data, you are ready to begin asking some more detailed questions of the dataset.
In particular, you are interested in exploring some characteristics of the different species of ducks as well as understand the probabilities of duck flock sizes relative to given critical thresholds. You’re going to be working with probabilities and descriptive statistics.
Lab Purpose:
In lecture we will discuss definitions, axioms, and theorems of probability in lectures – now let’s apply them to our duck data sets. Recall that probability quantifies the likelihood that event will occur – so our first task is to define some simple events – each defined for a given observation. Let’s focus on something of concern for coastal development: duck wintering sites on the shore vs off shore.
Are ducks found on shore (0 km from coast)
Are ducks found off shore?
We might ask if these are connected to particular species. To do so, it may be easiest to first separate species into particular nominal categories.
Black scoter (Melanitta Americana, code BLSC), a near threatened species
American common eider (Somateria mollissima dresseri, code COEI), a near threatened species
Long-tailed duck (Clangula hyemalis, code LTDU), a vulnerable species
Surf scoter (Melanitta perspicillata, code SUSC), a species of least concern for conservation
White-winged scoter (Melanitta fusca, code WWSC), a species of least concern for conservation
Unidentified dark-winged scoter (surf or black scoter, code DWSC)
Unidentified scoter (Melanitta sp., code SCOT)
The “codes” correspond to the “species” column in your duck dataset.
For this lab, we’ll begin to explore species at risk designations for development planning by assessing the probability of finding species at risk¬ – both over the full dataset, and conditional on i) flock size and ii) year.
How?
We can use logical operators to find observations that meet certain conditions – then mark them as belonging to a certain event. For example, suppose I wanted to identify flocks at or above sea level (depth =0 m, where negative values indicate height above water), and flocks over deep water (depth 20m). I could set up three events:
depth =0m
depth 0m AND depth = 20m
depth 20m
evnts - data$depth*NA # Creates a new (empty) vector,
# the same size as my depth vector
evnts[data$depth = 0] - 1 # Where depth = 0, mark evnts
# with 1
evnts[data$depth 0 &
data$depth = 20] - 2 # Where depth btwn 0,20, mark
# evnts with 2
evnts[data$depth 20] - 3 # Where depth 20, mark evnts
# with 3
You can also use which; it adds a line to each step, but would accomplish the same thing:
i - which(data$depth = 0) # Find entries where depth - 0;
# save them to object ‘i
evnts[i] - 1 # Mark the same entries in evnts
# with ‘1’.
You can also use which to list out the elements in your vector “evnts”. HINT: this could help count the number of a given event
which(evnts== 1)
You’ll need to think a little about how to identify species at risk and flock size events – what combination of logical operations ( , , =, =, ==, !=…and ways to connect statements, like ‘&’ or ‘ | ‘) will pull out the set you want?
You’ll also need to think about how to ‘point’ to certain entries in a matrix, vector, or data frame!
That might mean ‘indexing’ vectors/matrices, with [ ] and commas (where appropriate). It might mean correctly naming columns (e.g. data$flock_size).
You’ll need to think about creating new vectors or variables to hold your information. I often like to create empty objects by copying existing data – this way, it automatically has the right dimensions. (e.g. evnts - data$depth*NA creates ‘evnts’ from the “depth” column of data; multiplying by NA sets all entries to NA).
You’ll need to think about how to estimate the probability of events, conditional probability, etc. THIS IS JUST DOING MATH IN R!
Pr?{E}? (Number of times E occurs)/(Number of opportunities for E to occur)=a/n (Eq.1)
Where a is the number of times E was observed, and n is the number of observations.
Descriptive statistics
Usually it’s useful to understand the distribution of samples, and simple descriptive statistics to understand basic features of the data. For example, if you find that most species at risk have small flock sizes using the probability calculations above, plotting the distribution may reveal a long “tail” whereby there are a few cases of very, very large flock sizes. Plotting data can help understand the location, spread, and symmetry of your data, as well as assessing the robustness and resistance statistics used to quantify these characteristics.
One way to test resistance and robustness of a statistical measure is to calculate it repeatedly with subsamples of a larger data set. If a measure is robust and resistant, it shouldn’t vary too much with each re-calculation, even as your subsamples get small (and your estimate of various measures become less certain). This kind of repeated sampling has several useful applications in stats – notably, estimating a measurement’s uncertainty. We’ll explore them further in later labs.
It’s relatively easy to create a subsample in R: just use the sample function. Given a vector (x) and a number of subsamples (n), sample will pull n entries from x at random. By default, sample will not pull the same entry more than once – but you can request that it does.
Useful functions:
sum(x) Adds all values in the numerical object x
length(x) Gives the length of the vector x (NOTE: x MUST be a vector)
+, -, *, / Math operations (addition, subtraction, multiplication, division)
sample(x) creates a subsample in R
table(x) creates a table counting all the occurrences of specific entries
mean(x) calculates the mean of a vector
median(x) finds the median of a vector
sd(x) calculates the standard deviation of a vector
IQR(x) calculates the interquartile range of a vector
skewness(x) calculates the skewness of a vector. You need to source the “num.sum.funcs.R” script to run this
YKi(x) calculates the Yule-Kendall Index of a vector. You need to source the “num.sum.funcs.R” script to run this
Grading:
30% for a script that i) runs and ii) completes all required tasks.
5% deducted for each line of script that produces a crash/error message.
15% for including comments that make it easy to understand how the script works
15% for formatting your script in the requested manner.
Should be named appropriately
Must be easy to read for your TA
40% for your written interpretations
Tasks
Submit an R script that does the following:
Loads your data set
For each observation, assigns one of the following categories:
Depth =0m; flock on shore; call is S1
Depth 0 and Depth =20m; flock near short; call this S2
Depth 20m; flock off shore; call this S3
Calculates the probability that flocks are found: a) on shore, b) near shore, c) off shore. These probabilities will simply be the number of flocks found at a given depth, divided by the number of observations we have for each species (Pr{S1}, Pr{S2},Pr{S3})
Applies the custom specsplit function to create a list of seven datasets for each duck species (or species grouping)
Calculates the conditional probability of finding a flock on shore given that it is each one of the species (or species groups). Hint, you will calculate 7 different conditional probabilities here (Pr{S1|species})
Calculates the conditional probability of finding a species at risk (near threatened or vulnerable) given it is found on shore (Pr{at risk|S1})
Reports (prints) a paragraph comparing the probability of finding flocks on shore by species, versus the likelihood of finding a species at risk if looking on shore.
Proves Bayes Theorem applies to your duck data set; that is, show that:
Pr?{at risk¦S1}=(Pr?{S1¦at risk} Pr?{at risk})/(Pr?{S1})
Plots a histogram of the depths of all duck flocks
Calculates (and reports) the following descriptive statistics for the depth data for all duck flocks. HINT: You will need to source the num.sum.funcs.R script for some of these calculations:
Mean
Median
Standard deviation
IQR
Skewness
Yule-Kendall Index
Creates three random subsamples of depth of all duck flocks, storing them as objects. Each should have a length of 100 (n=100)
Calculates the same set of descriptive statistics for each subsample:
Mean
Median
Standard deviation
IQR
Skewness
Yule-Kendall Index
For each statistic, save results for your 3 subsamples as a new (3 element) vector. So you’ll have a vector with your 3 subsample means, another for your 3 subsample medians, etc.
Reports the RANGE of your subsampled statistics, and prints comments on the relative i) agreement and ii) resistance of
Location measures (a-b)
Spread measures (c-d)
Symmetry measures (e-f)
Prints a paragraph, commenting on the uncertainty of the statistics calculated as part of question 10