Lab 2: Estimating Probabilities and Exploratory Statistics

IF YOU NEED HELP: Review the following

Logical operators (R Handbook)

Indexing of data frames/matrices/vectors (R Handbook)

Setting:

As an environmental consultant working on development planning for the eastern United States, you have successfully been able to locate and begin to explore data on sea duck wintering areas (Lab 1!). Now that you have explored and mapped some of the raw data, you are ready to begin asking some more detailed questions of the dataset.

In particular, you are interested in exploring some characteristics of the different species of ducks as well as understand the probabilities of duck flock sizes relative to given critical thresholds. You’re going to be working with probabilities and descriptive statistics.

Lab Purpose:

In lecture we will discuss definitions, axioms, and theorems of probability in lectures – now let’s apply them to our duck data sets. Recall that probability quantifies the likelihood that event will occur – so our first task is to define some simple events – each defined for a given observation. Let’s focus on something of concern for coastal development: duck wintering sites on the shore vs off shore.

Are ducks found on shore (0 km from coast)

Are ducks found off shore?

We might ask if these are connected to particular species. To do so, it may be easiest to first separate species into particular nominal categories.

Black scoter (Melanitta Americana, code BLSC), a near threatened species

American common eider (Somateria mollissima dresseri, code COEI), a near threatened species

Long-tailed duck (Clangula hyemalis, code LTDU), a vulnerable species

Surf scoter (Melanitta perspicillata, code SUSC), a species of least concern for conservation

White-winged scoter (Melanitta fusca, code WWSC), a species of least concern for conservation

Unidentified dark-winged scoter (surf or black scoter, code DWSC)

Unidentified scoter (Melanitta sp., code SCOT)

The “codes” correspond to the “species” column in your duck dataset.

For this lab, we’ll begin to explore species at risk designations for development planning by assessing the probability of finding species at risk¬ – both over the full dataset, and conditional on i) flock size and ii) year.

How?

We can use logical operators to find observations that meet certain conditions – then mark them as belonging to a certain event. For example, suppose I wanted to identify flocks at or above sea level (depth =0 m, where negative values indicate height above water), and flocks over deep water (depth 20m). I could set up three events:

depth =0m

depth 0m AND depth = 20m

depth 20m

evnts - data$depth*NA # Creates a new (empty) vector,

# the same size as my depth vector

evnts[data$depth = 0] - 1 # Where depth = 0, mark evnts

# with 1

evnts[data$depth 0 &

data$depth = 20] - 2 # Where depth btwn 0,20, mark

# evnts with 2

evnts[data$depth 20] - 3 # Where depth 20, mark evnts

# with 3

You can also use which; it adds a line to each step, but would accomplish the same thing:

i - which(data$depth = 0) # Find entries where depth - 0;

# save them to object ‘i

evnts[i] - 1 # Mark the same entries in evnts

# with ‘1’.

You can also use which to list out the elements in your vector “evnts”. HINT: this could help count the number of a given event

which(evnts== 1)

You’ll need to think a little about how to identify species at risk and flock size events – what combination of logical operations ( , , =, =, ==, !=…and ways to connect statements, like ‘&’ or ‘ | ‘) will pull out the set you want?

You’ll also need to think about how to ‘point’ to certain entries in a matrix, vector, or data frame!

That might mean ‘indexing’ vectors/matrices, with [ ] and commas (where appropriate). It might mean correctly naming columns (e.g. data$flock_size).

You’ll need to think about creating new vectors or variables to hold your information. I often like to create empty objects by copying existing data – this way, it automatically has the right dimensions. (e.g. evnts - data$depth*NA creates ‘evnts’ from the “depth” column of data; multiplying by NA sets all entries to NA).

You’ll need to think about how to estimate the probability of events, conditional probability, etc. THIS IS JUST DOING MATH IN R!

Pr?{E}? (Number of times E occurs)/(Number of opportunities for E to occur)=a/n (Eq.1)

Where a is the number of times E was observed, and n is the number of observations.

Descriptive statistics

Usually it’s useful to understand the distribution of samples, and simple descriptive statistics to understand basic features of the data. For example, if you find that most species at risk have small flock sizes using the probability calculations above, plotting the distribution may reveal a long “tail” whereby there are a few cases of very, very large flock sizes. Plotting data can help understand the location, spread, and symmetry of your data, as well as assessing the robustness and resistance statistics used to quantify these characteristics.

One way to test resistance and robustness of a statistical measure is to calculate it repeatedly with subsamples of a larger data set. If a measure is robust and resistant, it shouldn’t vary too much with each re-calculation, even as your subsamples get small (and your estimate of various measures become less certain). This kind of repeated sampling has several useful applications in stats – notably, estimating a measurement’s uncertainty. We’ll explore them further in later labs.

It’s relatively easy to create a subsample in R: just use the sample function. Given a vector (x) and a number of subsamples (n), sample will pull n entries from x at random. By default, sample will not pull the same entry more than once – but you can request that it does.

Useful functions:

sum(x) Adds all values in the numerical object x

length(x) Gives the length of the vector x (NOTE: x MUST be a vector)

+, -, *, / Math operations (addition, subtraction, multiplication, division)

sample(x) creates a subsample in R

table(x) creates a table counting all the occurrences of specific entries

mean(x) calculates the mean of a vector

median(x) finds the median of a vector

sd(x) calculates the standard deviation of a vector

IQR(x) calculates the interquartile range of a vector

skewness(x) calculates the skewness of a vector. You need to source the “num.sum.funcs.R” script to run this

YKi(x) calculates the Yule-Kendall Index of a vector. You need to source the “num.sum.funcs.R” script to run this

Grading:

30% for a script that i) runs and ii) completes all required tasks.

5% deducted for each line of script that produces a crash/error message.

15% for including comments that make it easy to understand how the script works

15% for formatting your script in the requested manner.

Should be named appropriately

Must be easy to read for your TA

40% for your written interpretations

Tasks

Submit an R script that does the following:

Loads your data set

For each observation, assigns one of the following categories:

Depth =0m; flock on shore; call is S1

Depth 0 and Depth =20m; flock near short; call this S2

Depth 20m; flock off shore; call this S3

Calculates the probability that flocks are found: a) on shore, b) near shore, c) off shore. These probabilities will simply be the number of flocks found at a given depth, divided by the number of observations we have for each species (Pr{S1}, Pr{S2},Pr{S3})

Applies the custom specsplit function to create a list of seven datasets for each duck species (or species grouping)

Calculates the conditional probability of finding a flock on shore given that it is each one of the species (or species groups). Hint, you will calculate 7 different conditional probabilities here (Pr{S1|species})

Calculates the conditional probability of finding a species at risk (near threatened or vulnerable) given it is found on shore (Pr{at risk|S1})

Reports (prints) a paragraph comparing the probability of finding flocks on shore by species, versus the likelihood of finding a species at risk if looking on shore.

Proves Bayes Theorem applies to your duck data set; that is, show that:

Pr?{at risk¦S1}=(Pr?{S1¦at risk} Pr?{at risk})/(Pr?{S1})

Plots a histogram of the depths of all duck flocks

Calculates (and reports) the following descriptive statistics for the depth data for all duck flocks. HINT: You will need to source the num.sum.funcs.R script for some of these calculations:

Mean

Median

Standard deviation

IQR

Skewness

Yule-Kendall Index

Creates three random subsamples of depth of all duck flocks, storing them as objects. Each should have a length of 100 (n=100)

Calculates the same set of descriptive statistics for each subsample:

Mean

Median

Standard deviation

IQR

Skewness

Yule-Kendall Index

For each statistic, save results for your 3 subsamples as a new (3 element) vector. So you’ll have a vector with your 3 subsample means, another for your 3 subsample medians, etc.

Reports the RANGE of your subsampled statistics, and prints comments on the relative i) agreement and ii) resistance of

Location measures (a-b)

Spread measures (c-d)

Symmetry measures (e-f)

Prints a paragraph, commenting on the uncertainty of the statistics calculated as part of question 10

IF YOU NEED HELP: Review the following

Logical operators (R Handbook)

Indexing of data frames/matrices/vectors (R Handbook)

Setting:

As an environmental consultant working on development planning for the eastern United States, you have successfully been able to locate and begin to explore data on sea duck wintering areas (Lab 1!). Now that you have explored and mapped some of the raw data, you are ready to begin asking some more detailed questions of the dataset.

In particular, you are interested in exploring some characteristics of the different species of ducks as well as understand the probabilities of duck flock sizes relative to given critical thresholds. You’re going to be working with probabilities and descriptive statistics.

Lab Purpose:

In lecture we will discuss definitions, axioms, and theorems of probability in lectures – now let’s apply them to our duck data sets. Recall that probability quantifies the likelihood that event will occur – so our first task is to define some simple events – each defined for a given observation. Let’s focus on something of concern for coastal development: duck wintering sites on the shore vs off shore.

Are ducks found on shore (0 km from coast)

Are ducks found off shore?

We might ask if these are connected to particular species. To do so, it may be easiest to first separate species into particular nominal categories.

Black scoter (Melanitta Americana, code BLSC), a near threatened species

American common eider (Somateria mollissima dresseri, code COEI), a near threatened species

Long-tailed duck (Clangula hyemalis, code LTDU), a vulnerable species

Surf scoter (Melanitta perspicillata, code SUSC), a species of least concern for conservation

White-winged scoter (Melanitta fusca, code WWSC), a species of least concern for conservation

Unidentified dark-winged scoter (surf or black scoter, code DWSC)

Unidentified scoter (Melanitta sp., code SCOT)

The “codes” correspond to the “species” column in your duck dataset.

For this lab, we’ll begin to explore species at risk designations for development planning by assessing the probability of finding species at risk¬ – both over the full dataset, and conditional on i) flock size and ii) year.

How?

We can use logical operators to find observations that meet certain conditions – then mark them as belonging to a certain event. For example, suppose I wanted to identify flocks at or above sea level (depth =0 m, where negative values indicate height above water), and flocks over deep water (depth 20m). I could set up three events:

depth =0m

depth 0m AND depth = 20m

depth 20m

evnts - data$depth*NA # Creates a new (empty) vector,

# the same size as my depth vector

evnts[data$depth = 0] - 1 # Where depth = 0, mark evnts

# with 1

evnts[data$depth 0 &

data$depth = 20] - 2 # Where depth btwn 0,20, mark

# evnts with 2

evnts[data$depth 20] - 3 # Where depth 20, mark evnts

# with 3

You can also use which; it adds a line to each step, but would accomplish the same thing:

i - which(data$depth = 0) # Find entries where depth - 0;

# save them to object ‘i

evnts[i] - 1 # Mark the same entries in evnts

# with ‘1’.

You can also use which to list out the elements in your vector “evnts”. HINT: this could help count the number of a given event

which(evnts== 1)

You’ll need to think a little about how to identify species at risk and flock size events – what combination of logical operations ( , , =, =, ==, !=…and ways to connect statements, like ‘&’ or ‘ | ‘) will pull out the set you want?

You’ll also need to think about how to ‘point’ to certain entries in a matrix, vector, or data frame!

That might mean ‘indexing’ vectors/matrices, with [ ] and commas (where appropriate). It might mean correctly naming columns (e.g. data$flock_size).

You’ll need to think about creating new vectors or variables to hold your information. I often like to create empty objects by copying existing data – this way, it automatically has the right dimensions. (e.g. evnts - data$depth*NA creates ‘evnts’ from the “depth” column of data; multiplying by NA sets all entries to NA).

You’ll need to think about how to estimate the probability of events, conditional probability, etc. THIS IS JUST DOING MATH IN R!

Pr?{E}? (Number of times E occurs)/(Number of opportunities for E to occur)=a/n (Eq.1)

Where a is the number of times E was observed, and n is the number of observations.

Descriptive statistics

Usually it’s useful to understand the distribution of samples, and simple descriptive statistics to understand basic features of the data. For example, if you find that most species at risk have small flock sizes using the probability calculations above, plotting the distribution may reveal a long “tail” whereby there are a few cases of very, very large flock sizes. Plotting data can help understand the location, spread, and symmetry of your data, as well as assessing the robustness and resistance statistics used to quantify these characteristics.

One way to test resistance and robustness of a statistical measure is to calculate it repeatedly with subsamples of a larger data set. If a measure is robust and resistant, it shouldn’t vary too much with each re-calculation, even as your subsamples get small (and your estimate of various measures become less certain). This kind of repeated sampling has several useful applications in stats – notably, estimating a measurement’s uncertainty. We’ll explore them further in later labs.

It’s relatively easy to create a subsample in R: just use the sample function. Given a vector (x) and a number of subsamples (n), sample will pull n entries from x at random. By default, sample will not pull the same entry more than once – but you can request that it does.

Useful functions:

sum(x) Adds all values in the numerical object x

length(x) Gives the length of the vector x (NOTE: x MUST be a vector)

+, -, *, / Math operations (addition, subtraction, multiplication, division)

sample(x) creates a subsample in R

table(x) creates a table counting all the occurrences of specific entries

mean(x) calculates the mean of a vector

median(x) finds the median of a vector

sd(x) calculates the standard deviation of a vector

IQR(x) calculates the interquartile range of a vector

skewness(x) calculates the skewness of a vector. You need to source the “num.sum.funcs.R” script to run this

YKi(x) calculates the Yule-Kendall Index of a vector. You need to source the “num.sum.funcs.R” script to run this

Grading:

30% for a script that i) runs and ii) completes all required tasks.

5% deducted for each line of script that produces a crash/error message.

15% for including comments that make it easy to understand how the script works

15% for formatting your script in the requested manner.

Should be named appropriately

Must be easy to read for your TA

40% for your written interpretations

Tasks

Submit an R script that does the following:

Loads your data set

For each observation, assigns one of the following categories:

Depth =0m; flock on shore; call is S1

Depth 0 and Depth =20m; flock near short; call this S2

Depth 20m; flock off shore; call this S3

Calculates the probability that flocks are found: a) on shore, b) near shore, c) off shore. These probabilities will simply be the number of flocks found at a given depth, divided by the number of observations we have for each species (Pr{S1}, Pr{S2},Pr{S3})

Applies the custom specsplit function to create a list of seven datasets for each duck species (or species grouping)

Calculates the conditional probability of finding a flock on shore given that it is each one of the species (or species groups). Hint, you will calculate 7 different conditional probabilities here (Pr{S1|species})

Calculates the conditional probability of finding a species at risk (near threatened or vulnerable) given it is found on shore (Pr{at risk|S1})

Reports (prints) a paragraph comparing the probability of finding flocks on shore by species, versus the likelihood of finding a species at risk if looking on shore.

Proves Bayes Theorem applies to your duck data set; that is, show that:

Pr?{at risk¦S1}=(Pr?{S1¦at risk} Pr?{at risk})/(Pr?{S1})

Plots a histogram of the depths of all duck flocks

Calculates (and reports) the following descriptive statistics for the depth data for all duck flocks. HINT: You will need to source the num.sum.funcs.R script for some of these calculations:

Mean

Median

Standard deviation

IQR

Skewness

Yule-Kendall Index

Creates three random subsamples of depth of all duck flocks, storing them as objects. Each should have a length of 100 (n=100)

Calculates the same set of descriptive statistics for each subsample:

Mean

Median

Standard deviation

IQR

Skewness

Yule-Kendall Index

For each statistic, save results for your 3 subsamples as a new (3 element) vector. So you’ll have a vector with your 3 subsample means, another for your 3 subsample medians, etc.

Reports the RANGE of your subsampled statistics, and prints comments on the relative i) agreement and ii) resistance of

Location measures (a-b)

Spread measures (c-d)

Symmetry measures (e-f)

Prints a paragraph, commenting on the uncertainty of the statistics calculated as part of question 10

Subject Title Visitor Behaviour and ManagementSubject Code HOS804Lecturer / Tutor Dr Mirrin LockeTerm May 2021 TrimesterAssessment Title Assessment 3 – Destination Assessment ReportLearning Outcome/s 1,...Vulnerable groups diabetes profileIntent:In this assessment, we ask you to examine the demographic and current health status profiles of the 2770 postcode geography and identify which groups you think...Given the 5 Data mining functionalities:1. Association: it’s the connection between two objects.2. Classification: if you’re a bank manager, you have to classify customers then only you can able to take...• Five moral dimensions of theinformation age: 1.Information rights and obligations 2.Property rights and obligations 3.Accountability and control 4.System quality 5.Quality of lifeQ: which do you think...Assessment detailsAssessment task 1 Practice scenarioTask Description This assessment provides opportunity to reflect on your practice with your fellow students through an online discussion group before...STAFFORDSHIRE UNIVERSITYSchool of Law, Policing & ForensicsSession: 2020/2021Semester: 01Module Code: LAWS40501Module Title: TORT LAW & CIVIL REMEDIESCredits: 30 Credits (50% Weighting)Due Date:...School of Law, Policing and ForensicsDepartment of LawSession 2021Semester: 1Stoke United, a local football club, wished to promote its image and so the Marketing Manager placed an advertisement in The...**Show All Questions**