Visualization-I L8

Created by

miumiu

Cards (28)

We used numpy and pandas to read and manipulate data from a statistical and mathematical standpoint. In this lecture, you'll use the matplotlib and seaborn libraries to visualize your data, to get insights into your data that the statistics alone may not completely convey.
Univariate 
Single variables
Visualizations 
Help to understand how a single variable is distributed before considering complex interactions between multiple variables
You must always mind the goals of your investigation and explore the variables that will be key to answering your research questions.
Visualization
Helps to observe any abnormalities in the data (outliers, missing values, …) which can point us to where we need to do more cleaning or perform further inspection
Plot types
Bar charts for qualitative variables
Histograms for quantitative variables
Bar charts and histograms look exactly the same (rectangles on a pair of axes) but the datatype makes these plots distinct.
Bar chart 
Primary visualization choice to investigate the distribution of a qualitative variable
Bar chart
Each level of the categorical variable has a unique x position
The height of each bar illustrates the frequency occurrence for each categorical value
Nominal data 
Can be arranged in order of frequency (e.g. country, gender, race, hair color)
Ordinal data 
Should not be arranged as the inherent order of the levels is more important to convey (e.g. "First", "Second", ...)
Horizontal bar chart 
Convenient if you have a lot of categories or categories with a long name
Relative frequency 
Shows the proportion of the data that falls in each category, instead of absolute frequency
Pie chart
A univariate plot type for categorical variables that shows relative frequencies
Creating a pie chart in matplotlib
1. Use the pie function
2. Provide the data in summarized form
3. Set startangle=90 to start the first slice vertically upwards
4. Set counterclock=False to plot the sorted counts clockwise
5. Call the axis function to make the scaling equal on both x and y axes
Histogram
Used to plot the distribution of a numeric variable. It's the quantitative version of the bar chart.
Creating a histogram in matplotlib
1. Use the hist function
2. Pass the data and the name of the numeric variable
3. By default, the data is divided into 10 bins based on the range of values
10 bins are usually not sufficient to understand the distribution of data, so we might want to change the default settings.
Setting custom bin edges for a histogram 
1. Use numpy's arange function
2. The first argument is the leftmost bin edge
3. The second argument is the upper limit
4. The third argument is the bin width
5. Add 1 to the max value to include the maximum data value in the rightmost bin edge
When creating histograms, it's useful to try different bin widths to see what represents the data best.
Seaborn's distplot
Can also be used to plot a histogram, with built-in rules for specifying histogram bins. It also plots a kernel density estimate (KDE) curve by default.
Customizing the appearance of a histogram in distplot
1. Set kde=False to plot just the histogram without the KDE curve
2. Use hist_kws parameter to customize the appearance of the histogram (e.g. set alpha for transparency)
Figure
The base of a visualization in matplotlib
Axes 
Within each Figure there is one or more Axes objects
Creating a figure and axes explicitly
1. Use plt.figure() to create a new Figure object
2. Use fig.add_axes() to create a new Axes object in the Figure, specifying the dimensions
3. Plot the visualization using the Axes object (e.g. ax.hist())
Seaborn plot functions usually have an "ax" parameter to specify the Axes object to plot in.
In most cases, you can just use default matplotlib and seaborn functions, as they automatically follow the most recent Figure or Axes setup worked with.
The relative frequency of a particular observation or class interval is found by dividing the frequency(f) by the number of observation(n).
Relative frequency = frequency ÷ number of observations