EDA-Lesson 2-3-4

Cards (28)

  • Convenience sampling is a non-probability sampling method where the researcher selects the most readily available individuals to be included in the sample.
  • Statistics is the science of data, dealing with collection, presentation, analysis, and use of data to make decisions, solve problems, and design products and processes
  • Branches of Statistics:
    • Descriptive Statistics: describes characteristics and properties of a group, based on easily verifiable facts, does not draw inferences
    • Inferential Statistics: draws inferences about a population based on data gathered from samples, leading to predictions about larger data sets
  • Population:
    • Totality of all observations from which the data set is acquired
    • All possible events should be considered
    • Variable describing the population is a parameter
  • Sample:
    • Small groups taken from the population
    • Heterogeneous group representing the population
    • Variable describing a sample is a statistic
  • Variables in statistics:
    • Qualitative Variables: categorical data answered by non-numeric data
    • Quantitative Variables: numerical data that are countable or measurable quantities
  • Categories of Quantitative Data:
    • Continuous Data: measurable quantities with infinite values between intervals
    • Discrete Data: countable quantities with finite equal intervals
  • Dependent vs Independent Variable:
    • Independent Variable: naturally occurring phenomenon that can be altered
    • Dependent Variable: observed upon application of changes to the independent variable
    • Controlled Variable: kept constant to check for external effects
    • Extraneous Variable: minimal effect on the result
  • Scales of Measurement:
    • Nominal: assigning numbers to categorical data
    • Ordinal: assigning rank to data levels
    • Interval: assigning constant difference between numeric data
    • Ratio: assigning continuous range of data
  • Sampling:
    • Process of taking samples from the population
    • Probability Sampling: eliminates biases against certain events
    • Simple Random Sampling: arranging population according to rules and selecting randomly
    • Systematic Sampling: arranging population in order and selecting every kth element
  • Sampling (cont.):
    • Stratified Sampling: grouping population into strata and performing random sampling
    • Cluster Sampling: identifying clusters with heterogeneous characteristics and selecting a cluster as a sample
    • Non-Probability Sampling: certain or no chance of an individual being selected
  • Data Presentation:
    • Textual Form: presentation using sentences and paragraphs
    • Tabular Form: presentation using tables
    • Graphical Form: pictorial representation
  • Data Presentation (cont.):
    • Ungrouped Data: data points treated individually
    • Grouped Data: data points treated and grouped according to categories
    • Stem and Leaf Diagram: data split into "stem" and "leaf"
  • Frequency Distribution Table:
    • Class limits: smallest and largest values within the class interval
    • Class boundaries: more precise expression of the class interval
  • Class boundaries is acquired as the midpoint of the upper limit of the lower class and the lower limit of the upper class
  • Frequency: The number of observations falling within a particular class
  • Class width (class size): Numerical difference between the upper and lower class boundaries of a class interval
  • Class mark (class midpoint): The middle element of the class, usually symbolized by x
  • Cumulative Frequency Distribution: Derived from the frequency distribution by adding the class frequencies or partial sums
  • Types of Cumulative Frequency Distribution:
    • Less than cumulative frequency (<cf): Frequencies are less than or below the upper-class boundary they correspond to
    • Greater than cumulative frequency (>cf): Frequencies are greater than or above the lower class boundary they correspond to
  • Relative Frequency: Percentage frequency of the class with respect to the total population, used for presenting pie charts
  • Relative Frequency Distribution: The proportion in percent of the frequency of each class to the total frequency, obtained by dividing the class frequency by the total frequency and multiplying by 100
  • Steps in Constructing a Frequency Distribution Table:
    1. Get the lowest and highest value in the distribution
    2. Get the value of the range
    3. Determine the number of classes using Sturge's Formula or the Square root Principle
    4. Determine the size of the class interval
    5. Construct the classes
    6. Determine the frequency of each class by counting the number of items in each interval
  • Graphical Form of Frequency Distribution:
    • Frequency Polygon: Line graph with points plotted at the midpoint of the classes
    • Histogram: Bar graph plotted at the exact lower limits of the classes
    • Ogive: Line graph representing the cumulative frequency distribution, where the ogive represents <cf and > ogive represents >cf
  • Steps to create an ogive graph:
    1. Calculate Cumulative Frequencies
    2. Choose the Scale
    3. Plot Points
    4. Connect Points to form a step-like curve representing the ogive
  • Measure of Central Tendency:
    • Mean: Most widely used parameter for describing ratio data, calculated by summing values and dividing by the number of values
    • Median: The midpoint of values after they have been ordered from smallest to largest
    • Mode: The value that appears most frequently
  • Measure of Variation (Dispersion):
    • Range: The difference between the largest and smallest number in the set
    • Mean Absolute Deviation (MAD): The average of unsigned deviations from the mean
    • Variance: The average of square deviations
    • Standard Deviation: The positive square root of the variance
    • Coefficient of Variation (CV): The percentage of the ratio of standard deviation to the mean
  • Measure of Shape:
    • Skewness: Degree of asymmetry of distribution about a mean
    • Kurtosis: The degree of peakedness exhibited by the distribution