STATISTICS AND PROBABILITY-DATA ANALYSIS

Cards (30)

  • statistics - the study of collecting, organizing, presenting, analyzing, and interpreting data.
  • Areas of Statistics
    Descriptive - summarize the characteristics of a data set
    Inferential - test a hypothesis or assess whether your data is generalizable to the broader population
  • Population – usually refers to very large amount of data where making a census or a complete enumeration of all of the population would be impractical or impossible.
  • sample – is a subset of the population is usually used.
  • Sources of Data - primary and secondary
  • Primary Data - surveys, interviews, direct observations
    Secondary Data - news paper, journals, research paper
  • Advantages of primary data
    • The researchers can decide the type of method they will use in collecting the data and how long it will take them to gather the data.
    • The researcher can focus the data collection on specific issues of their research and enable them to collect more accurate information.
    • The researchers would know in detail how the data is gathered and hence, will be able to present original and unbiased data.
  • Disadvantages of primary data
    • A primary data collection consumes a lot of time, effort, and cost; the researchers will not only need to make certain preparations, in addition, they will need to manage both their time and cost effectively.
    • The researchers will have to collect large volumes of data since they will interact with different people and environments; also, they will need to spend a lot of time checking, analysing and evaluating their findings before using such data.
  • Advantages of Secondary Data
    • Using data from secondary sources is more convenient as it requires less time, effort and cost.
    • Secondary data helps to decide what further researchers nee t be done.
  • Disadvantages of Secondary Data
    • Secondary data may have transcription errors
    • Data from secondary sources may not meet the user’s specific needs
    • Not all secondary data is readily available or inexpensive
    • The accuracy of the secondary data can be questionable.
  • Methods of Data Collection
    • observation
    • experimentation
    • simulation
    • interviewing
    • panel method
    • mail survey
    • project techiques
    • sociometry
  • Tools for Data Collection
    • types of tools
    • constructing schedules and questionnaires
    • pilot studies and pre tests
  • Data Science
    • center - data (especially big data)
    • purpose - obtain information and knowledge
  • Data Science
    • The center of data science is data, especially Big Data.
    • The purpose of data science is to obtain information or knowledge from the data that will help in making better decisions and understanding the development and change of nature or society better.
    • Data science is a multidisciplinary field that has applied theories and technologies from several disciplines.
  • Data Scientist - one-part mathematician, one-part computer scientist, and one-part trend-spotter because of their duties to collect large amounts of unruly data and organizing them for various forms of consumption—from spotting trends to predicting outcomes, or even to visualizing information so that it can be easily read
  • Data science involves the collection, organization, analysis and visualization of large amounts of data. Statisticians, meanwhile, use mathematical models to quantify relationships between variables and outcomes and make predictions based on those relationships.
  • Tools of the Trade - Data Science
    • Collect large amounts of messy data and transform it into a more usable format.
    • Solve business-related problems using data-driven techniques.
    • Work with a variety of programming languages (SAS, R, Python, etc.).
    • Have a solid grasp of statistics, including tests and distributions.
    • Learn analytical techniques such as machine learning, deep learning, and text analytics.
    • Look for order and patterns in data and spot trends that can help a business’ bottom line.
    • Communicate and collaborate with both IT and business.
  • Open Source Programming Languages
    • R
    • SAS Language
    • Python
  • R - a language and environment for statistical computing and graphics developed by Bell Laboratories (present-day Lucent Technologies). It allows users to extrapolate data into a wide variety of statistical and graphical techniques. It is also a free software and highly extensible (Foundation, n.d.). It can compile and run on a wide variety of operating systems.
  • Python - is an object-oriented, interpreted, and interactive programming language developed by Guido van Rossum. It combines remarkable power with very clear syntax and is compatible with other programming languages depending on the user’s preferences (Holden, 2018).
  • SAS language - is a programming language developed by Anthony James Barr as a statistical analysis tool. It is the leading tool in commercial analytics space, offering a variety of functions and a good user interface that can be easily learned. However, it is the most expensive language (Jain, 2017)
  • Probability - the chance that a particular event will occur
  • Terms in Probability
    • Probability – is a field of mathematics that deals with chances.
    • Experiment – is an activity in which the results cannot be predicted with certainty. Each repetition of an experiment is called trial
  • Terms in Probability
    Outcome- is a result of an experiment.
    Event- is any collection of outcomes, and simple event is an event with only one possible outcome.
    Sample Space- A set of all the possible outcomes of a random experiment. Represented by symbol S.
  • Classical Method - The classical method of determining probability is used if all of the probable outcomes are known in advance and all outcomes are equally likely.
  • Frequentist Probability - It defines an event's probability as the limit of its relative frequency in many trials. Probabilities can be found by a repeatable objective process.
  • Subjective Probability -is a type of probability derived from an individual's personal judgment or own experience about whether a specific outcome is likely to occur. It contains no formal calculations and only reflects the subject's opinions and past experience rather than on data or computation.
  • Bayesian Probability - Computing posterior probabilities based on the collected data.
  • Random Experiment - Any process that can be repeated under similar condition, whose outcomes cannot be predicted with certainty.
  • Event - The subset of possible outcomes of an experiment.