ENGR 120 Quiz 7 (9a Statistics)

Subdecks (3)

Cards (53)

  • The describe() method can compute several summary measures for all numerical data in a dataframe
  • The groupby() method allows you to group data together according to a particular variable
  • For inferential statistical tests, we'll use the SciPy library
  • Pearson's correlation is used to evaluate the linear relationship between two sets of continuous values
  • The r value from Pearson's correlation ranges from -1 to 1 and indicates the direction and strength of the correlation
  • The p value from Pearson's correlation indicates whether the result is statistically significant, with p < 0.05 being the standard threshold
  • Independent samples t-test is used to compare two unrelated datasets
  • Paired t-test is used to compare two related datasets from the same group
  • The t value from a t-test indicates the size of the difference between the two datasets
  • Summary method that will show the number of observations per unique values in a specific column.
    Syntax: movies_df['fandango'].value_counts()
  • Quantile method syntax: movies_df['fandango'].quantile([.25, .75]) This will print the 25th and 75th percentiles for the ratings in the fandango column
  • Describe method: A single command to compute descriptive statistics for numerical data movies_df.describe()
  • Use the groupby() method to group the data according to the drug column and the mean() method. This will average across the different time periods. Syntax: drugs_df.groupby('drug').mean(numeric_only = True)
  • Use the groupby() method to group the data according to the time_period column, once again using the mean() method. Syntax: drugs_df.groupby('time_period').mean(numeric_only = True)
  • Syntax for standard deviation: df_name[‘col_name’].std()
  • Number of unique values for a given column df_name[‘col_name’].nunique()
  • Number of observations for a given column df_name[‘col_name’].count()
  • To group by more than one column, provide a list: df_name.groupby([‘col_1’, ‘col_2’]).method_name(numeric_only = True)
  • Syntax to import stats module from SciPy: from scipy import stats
  • To use a function within stats, you do not need to include the word scipy. You only need to stats.func_name()
  • Pearson correlation syntax: stats.pearsonr(dataframe['column1'], dataframe ['column2'])
  • read in a file with pandas: dataframe = pd.read_csv(file outpath)
  • random selection of 5 rows: dataframe.sample(n = 5)
  • Teams with more than than 300 respondents or fewer than 50 Independentsnfl_df[(nfl_df.tot_respondents > 300) | (nfl_df.independent < 50)]
  • Percent democrat#Teams for which more than 25% (i.e., 0.25) of respondents are Democrats
    nfl_df[(nfl_df.democrat / nfl_df.tot_respondents) > 0.25]
  • Sort the dataframe by Independents, in descending order, such that the change to the dataframe is permanent.
    nfl_df.sort_values(by = 'independent', ascending = False, inplace = True)