FIT1043

Subdecks (1)

Cards (252)

  • Data Science right now is in a growth phase with increasing demand of skilled data science practitioners in industry, academia, and government
  • Data Science
    The extraction of knowledge from data, which is a continuation of the field data mining and predictive analytics
  • Narrow Data Science
    Machine learning on big data
  • Broad Data Science
    Extraction of knowledge/value from data through complete data lifecycle process
  • Big Data
    A broad term for data sets so large or complex that traditional data processing applications are inadequate
  • The main intention of Data Science is to learn from data and get insight from data
  • Data Science examples
    • Google's spell checker
    • Google's translation engine
    • Amazon's recommendation engine
  • Difference between Science and Data Science
    Science is more focused on theory and coming up with theories to generalize, while Data Science has less focus on theory, uses models to get insights from data, and is more focused on predictive capabilities rather than explaining phenomena
  • Hacking and math skills

    Needed for machine learning
  • Domain knowledge and math skills

    Needed for traditional research - PhD researchers spend time acquiring expertise in specific areas
  • Hacking and expertise
    Danger - can appear to be a legitimate analysis without any understanding of how they got there or how to interpret what they have created
  • Machine Learning
    Concerned with the development of algorithms and techniques that allow computers to learn (building computers, computational output, statistics is the underlying theory)
  • Reasons to use Machine Learning
    • Human expertise is not available (martian exploration)
    • Many solutions need to be adapted automatically (user personalisation)
    • Humans are expensive to use for the work (handwritten zip code recognition)
    • Situation changes overtime (junk mail)
    • Large amounts of data (discover astronomical objects)
  • Data Science Process
    • Pitching ideas
    • Collecting data
    • Integration
    • Interpretation
    • Governance
    • Engineering
    • Wrangling
    • Modelling
    • Visualisation
    • Operationalization
  • Data Scientist
    Addresses the data science process to extract meaning/value from data (middle row- collect to present)
  • Chief Data Scientist
    A form of chief scientist who addresses data management, data engineering, and data science goals (all)
  • Relationship of Data Science to Other Disciplines
    • Data engineering: building scalable systems for storage and processing data, deals with storage and computational resources across full lifecycle of data science, data accessibility is the ultimate goal
    • Data analyst: performing analysis and understanding results
    • Data management: managing data through its life cycle
  • Python Basic Types
    Integers, floats, boolean, strings
  • Python is a dynamic typed language, so variables do not need to be declared
  • Python Built-in Functions
    • Maths, conversions, iterators, cases, attributes, etc.
  • Python Operators and String Manipulation
    • Arithmetic operators: - + * / %
    • Comparison operators: > < <= >= != ==
    • String operators: + * in
  • Python Data Types
    • Lists: comma separated
    • Tuples: immutable
    • Dictionary: key-value pair
  • Python Control Structures
    • Conditions: if elif else
    • Iterations: while, for
  • Python Libraries for Data Science
    • Numpy: scientific computing, support for multidimensional arrays
    • Pandas: data structures as well as operations for manipulating numerical tables
    • Matplotlib: library for visualization
    • Scikit-learn: python machine learning library that provides the tools for data mining and data analysis
  • Reading from CSV files in Python
    1. Import pandas as pd
    2. Data = pd.read_csv("filename.csv")
  • Obtaining data summary in Python
    Minimum, maximum, median, 1st quartile, 3rd quartile
  • Working with dataframes in Python
    1. Select column using column name
    2. Select multiple columns using list of column names
    3. Select a value using column name and row index
    4. Select particular row
    5. Select all rows with a particular value in one of the columns
  • Saving data in Python

    1. df2 = df.loc[df['Age'] > 12 ]
    2. df2.to_csv("output.csv", index = None, header = True)
  • Categorical Data
    Has a specific value from a limited set of values, options are fixed, we create our own categories (e.g. 1st class, 2nd class)
  • Aggregation and groupby in Python
    1. Input → split → apply → combine
    2. Advanced aggregation: run multiple aggregation operators at once, write custom aggregators using anonymous functions
  • Types of Data Visualisation
    • Categorical Data: bar graphs, pie charts
    • Numeric data: histogram, box plots
    • Frequency tables
    • Bar charts
    • Pie charts
    • Histogram
    • Motion chart: interactive multi-dimensional data visualisation
    • Box Plots
    • Scatter plots
  • Motion Chart
    Interactive multi-dimensional data visualisation that allows visualizing data in 5 dimensions: x-axis, y-axis, size of bubble, color of bubble, time. Advantages: time dimension allows deeper insights and observing trends, good for exploratory work, appeals to the brain at a more instinctive intuitive level. Disadvantages: not suited for static media, display can be overwhelming, and controls are complex.
  • Measures of Centrality
    • Mean: average
    • Mode: most frequently occurring
    • Median: sorting and finding the middle sample
  • Symmetric distribution
    Mean and median are (nearly) the same
  • Positively skewed data

    Mean is greater than the median
  • Negatively skewed data

    Mean is less than the median
  • Measures of Spread
    • Range = max - min
    • Standard deviation: arithmetic mean of the squared derivations from the sample mean
    • Variance
  • Pearson correlation

    Measures linear association between two continuous variables
  • Association between categorical and numeric values
    Use a side-by-side boxplot if x is categorical and y is numeric
  • Association between two categorical variables
    Use a side by side graph