The critical process of performing INITIAL INVESTIGATION on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations
data value
is a piece of INFORMATION, such as a number or a date
Data Variable
is a characteristic that you can MEASURE, such as weight or income
Distribution
The __ of a dataset is how the dataset is SPREAD OUT. You can visualize a dataset's distribution by observing its shape on a graph
Outlier
is a data value that is SIGNIFICANTLY DIFFERENT, including much higher or lower, from the rest of a dataset
Data model
method of ORGANIZING data and relationships between values in a dataset
The hflights Dataset includes data on all flights that departed Houston, TX in 2011
Categorical data
Data that fits into categories (e.g. Gender, Country)
Quantitative data
NUMERICAL DATA which represents a numerical value (e.g. age, sales, population)
Converting to Factors
1. Factor variables are categorical variables that can be either numeric or string variables
2. Convert Origin, DayOfWeek, Month to factors
Univariate analysis
Analysis of a SINGLE VARIABLE with no cause-effect relationship
Bivariate analysis
ANALYSIS OF TEO VARIABLE to determine relationships between them, with one variable dependent and the other independent
Types of bivariate data analysis
1. Numerical and Numerical
2. Categorical and Categorical
3. Numerical and Categorical
Factor variables
are categorical variables that can be either numeric or string