Visualization-I L8

Cards (28)

  • We used numpy and pandas to read and manipulate data from a statistical and mathematical standpoint. In this lecture, you'll use the matplotlib and seaborn libraries to visualize your data, to get insights into your data that the statistics alone may not completely convey.
  • Univariate
    Single variables
  • Visualizations
    Help to understand how a single variable is distributed before considering complex interactions between multiple variables
  • You must always mind the goals of your investigation and explore the variables that will be key to answering your research questions.
  • Visualization
    Helps to observe any abnormalities in the data (outliers, missing values, …) which can point us to where we need to do more cleaning or perform further inspection
  • Plot types
    • Bar charts for qualitative variables
    • Histograms for quantitative variables
  • Bar charts and histograms look exactly the same (rectangles on a pair of axes) but the datatype makes these plots distinct.
  • Bar chart

    Primary visualization choice to investigate the distribution of a qualitative variable
  • Bar chart
    • Each level of the categorical variable has a unique x position
    • The height of each bar illustrates the frequency occurrence for each categorical value
  • Nominal data
    Can be arranged in order of frequency (e.g. country, gender, race, hair color)
  • Ordinal data

    Should not be arranged as the inherent order of the levels is more important to convey (e.g. "First", "Second", ...)
  • Horizontal bar chart

    Convenient if you have a lot of categories or categories with a long name
  • Relative frequency
    Shows the proportion of the data that falls in each category, instead of absolute frequency
  • Pie chart
    A univariate plot type for categorical variables that shows relative frequencies
  • Creating a pie chart in matplotlib
    1. Use the pie function
    2. Provide the data in summarized form
    3. Set startangle=90 to start the first slice vertically upwards
    4. Set counterclock=False to plot the sorted counts clockwise
    5. Call the axis function to make the scaling equal on both x and y axes
  • Histogram
    Used to plot the distribution of a numeric variable. It's the quantitative version of the bar chart.
  • Creating a histogram in matplotlib
    1. Use the hist function
    2. Pass the data and the name of the numeric variable
    3. By default, the data is divided into 10 bins based on the range of values
  • 10 bins are usually not sufficient to understand the distribution of data, so we might want to change the default settings.
  • Setting custom bin edges for a histogram
    1. Use numpy's arange function
    2. The first argument is the leftmost bin edge
    3. The second argument is the upper limit
    4. The third argument is the bin width
    5. Add 1 to the max value to include the maximum data value in the rightmost bin edge
  • When creating histograms, it's useful to try different bin widths to see what represents the data best.
  • Seaborn's distplot
    Can also be used to plot a histogram, with built-in rules for specifying histogram bins. It also plots a kernel density estimate (KDE) curve by default.
  • Customizing the appearance of a histogram in distplot
    1. Set kde=False to plot just the histogram without the KDE curve
    2. Use hist_kws parameter to customize the appearance of the histogram (e.g. set alpha for transparency)
  • Figure
    The base of a visualization in matplotlib
  • Axes
    Within each Figure there is one or more Axes objects
  • Creating a figure and axes explicitly
    1. Use plt.figure() to create a new Figure object
    2. Use fig.add_axes() to create a new Axes object in the Figure, specifying the dimensions
    3. Plot the visualization using the Axes object (e.g. ax.hist())
  • Seaborn plot functions usually have an "ax" parameter to specify the Axes object to plot in.
  • In most cases, you can just use default matplotlib and seaborn functions, as they automatically follow the most recent Figure or Axes setup worked with.
  • The relative frequency of a particular observation or class interval is found by dividing the frequency(f) by the number of observation(n).
    Relative frequency = frequency ÷ number of observations