Ground truth. = what the answer is as it exists in the world. This is impossible to do as we can’t collect ALL the data, so….
We want to use our data to build a model of the world.
Model: a formal representation of a system
Two types of models you could use:
Deterministic
Probabilistic/Stochastic/statistical
Deterministic models:
Deterministic models imply certainty and consistency, but real world data (especially human subjects data!) are complex. Aren’t such a great model for the world as it doesn’t account for randomness.
There are many factors that we can't anticipate or account for in our studies.
Probabilistic models:
With inferential statistics, we make sense of the world using probabilistic models, which take the element of randomness into account.
Inferential tests tell you something about the probability of your data and this helps guide your decision about the ground truth
What is probability?
Likelihood of event’s occurrence
2 ways to conceptualise probability:
Analytic definition
Relative frequency
Analytic Definition - the probability of an event is equal to the ratio of successful outcomes to all possible outcomes.
Relative Frequency - the proportion of times you would observe x if you took an infinite number of samples.
The law of large numbers:
Given an event x and a probability P(x), over n trials, the probability that the relative frequency of x′ will differ from P(x) approaches 0 as n approaches infinity.
The more trials, the more the figure will represent the ground truth.
Set:
Well-defined collection of objects; composed of elements or members
Sets:
A) x is an element of set A
B) x is not an element of set A
C) x is an integer >/ to 1 and >/ to 10
Universal set:
All possible elements in a category of interest
Subset:
If B is a subset of A:
All elements of B must also be in A
However, all elements in A do not have to exist in B (although they can)
Complement of A:
Is in the universal set, but not in the set A.
P( ~ A )
Set notation:
Subsets
A) B is a subset of A
B) B is a proper subset of A
C) At least 1 element of A is not a member of B
D) B isn't identical to A
E) A is not a subset of B
Set Operations:
There are also ways we can describe two distinct sets in terms of how they interact with each other.
Union: when an element is a member of either set A or set B (or both)
Interaction: when an element is a member of set A and set B
Difference: when an element is a member of set A but not set B, or vice versa
Empty Set: sets A and B are mutually exclusive; when Aoccurs, Bcannotoccur, set of no outcomes.
Intersection symbol:
intersect() function to find elements that are in A and B.
Unionsymbol: Probability of A or B
Difference, how to calculate in R:
setdiff(dogs, cats) gives you dog names that aren’t cats, if you reverse it, shows you cat names that aren’t dogs.
Empty sets:
Random experiments:
A procedure that meets certain criteria:
~Can be repeated infinitely under identical conditions
~Outcome depends on chance and can't be predicted in advance
By conducting a random experiment, we can make inferences about the likelihood of each of its outcomes
Sample Space:
All possibleoutcomes of a randomexperiment are referred to as the sample space.
Event:
An event is a subset of the outcomes from the sample space.
Simple event:
A simple event, a, refers to a single element in a sample space
Visualising probability:
A probability distribution is a mathematical function that describes the probability of each event within the sample space.
Plotting a probability distribution allows you to visualise the likelihood of all possible outcomes.
In statistics, we use probabilistic models to make inferences about our data.
We can also use the setequal() function to tell us if our two sets are equal. Will either say TRUE or FALSE.
Don’t ever install.packages(“”) in the RMarkdown, do that in the console.
The AND and OR operators:
& refers to ‘and’. It allows you to specify values that meet all given conditions.
The | operator stands for ‘or’ and allows you to specify values that meet any of the given conditions.
The set.seed() function reduces the randomness.
With bar graphs and histogram don’t need to tell R to do Y axis because it’s automatically the count.
In labs() part of ggplot() you can add a title.
By piping into prop.table() we can see the proportion of coloured skittles present in the sample R has taken.
Can sort table by ascending order of a specific column through
sort(table(table_name$column))
Or, for decreasing order, you can use:
sort(table(table_name$column), decreasing = TRUE)
ifelse(AudienceScore <= 50, "Bad", "Good"). Here the ifelse function, when piped into mutate(), writes Bad if the statement is TRUE, or writes Good if the statement is FALSE for each individual data point.