week 4

Cards (13)

  • Data quality issues
    • Misspelling and inconsistency
    • Inconsistency
    • Common cases
    • Upper case vs lower case
    • Inconsistency in domain value representation (ex. 0 no 1 yes)
    • Detecting and fixing
    • Investigate unique domain values (unique())
    • Make representation consistent (replace)
    • Misspelling
    • Investigate unique domain values (unique)
    • String matching
    • Calculate domain value frequencies (value_counts())
    • For all values, find matches for the infrequent values
    • Replace infrequent values with the best match (if it exist) from the more frequent values
    • Irregularities
    • Invalid dates
    • Domain dependent value, value not valid for a specific domain (ex. Negative value for number of passengers)
    • Integrity constraint violation
    • Sold date vs advertised date
    • One field is the sum of the other two
    • Land size must be greater than building size
    • Duplications
    • Complete duplication
    • Duplication due to field missing (diff record have diff piece of info)
    • Missing values
    • Imputation
    • Mean and mode
    • Regression (find variables that are closely related
    • Dummy value
    • Removal
    • Outliers
  • Detecting data quality issues
    1. Investigate unique domain values (unique)
    2. Investigate value ranges for the column
    3. Type casting (ex. Parse date string to datetime object to catch exceptions (pandas.to_datetime)
    4. Highly dependent on the domain and problems
    5. Identifying keys to check duplicates (try different keys)
    6. Investigate unique domain values
    7. Investigate value range, cautious about extremely small and large values
    8. Domain analysis
    9. Range of values using df.describe()
    10. Graphical tools (ex. boxplot)
    11. 3o edit rule
    12. Good to do some comparison between results found by different identifiers
  • Fixing data quality issues
    1. Replace
    2. Remove
    3. Swap
    4. Combining information/merge
    5. Remove duplicates
    6. Imputation
    7. Mean and mode
    8. Regression (find variables that are closely related
    9. Dummy value
    10. Removal
    11. Similar to handling missing values
  • Handling missing values
    • Imputation
    • Mean and mode
    • Regression (find variables that are closely related
    • Dummy value
    • Removal
  • Handling outliers
    • Similar to handling missing values
  • Integrity constraint violation is a common data quality issue that is highly dependent on context
  • Duplications can be due to complete duplication or duplication due to field missing (diff record have diff piece of info)
  • Outliers are not easy to find in data
  • Data quality issues
    • Misspelling and inconsistency
    • Inconsistency
    • Common cases
    • Upper case vs lower case
    • Inconsistency in domain value representation (ex. 0 no 1 yes)
    • Detecting and fixing
    • Investigate unique domain values (unique())
    • Make representation consistent (replace)
    • Misspelling
    • Investigate unique domain values (unique)
    • String matching
    • Calculate domain value frequencies (value_counts())
    • For all values, find matches for the infrequent values
    • Replace infrequent values with the best match (if it exist) from the more frequent values
    • Irregularities
    • Invalid dates
    • Domain dependent value, value not valid for a specific domain (ex. Negative value for number of passengers)
    • Integrity constraint violation
    • Sold date vs advertised date
    • One field is the sum of the other two
    • Land size must be greater than building size
    • Duplications
    • Complete duplication
    • Duplication due to field missing (diff record have diff piece of info)
    • Missing values
    • Extremely small and large values
    • Outliers
  • Detecting data quality issues
    1. Investigate unique domain values
    2. Investigate value ranges for the column
    3. Type casting (ex. Parse date string to datetime object to catch exceptions (pandas.to_datetime)
    4. Highly dependent on the domain and problems
    5. Identifying keys to check duplicates (try different keys)
    6. Investigate unique domain values
    7. Investigate value range, cautious about extremely small and large values
    8. Domain analysis
    9. Range of values using df.describe()
    10. Graphical tools (ex. boxplot)
    11. 3o edit rule
    12. Good to do some comparison between results found by different identifiers
  • Fixing data quality issues
    1. Refer to documentation if it exists to see if there's special meaning
    2. Replace
    3. Remove
    4. Swap
    5. Combining information/merge
    6. Remove duplicates
    7. Imputation
    8. Mean and mode
    9. Regression (find variables that are closely related
    10. Dummy value
    11. Removal
    12. Similar to handling missing values
  • All depends on situation and needs justification
  • Challenge: Outliers are not easy to find