week 4

Created by

Maddie

Cards (13)

Data quality issues 
Misspelling and inconsistency
Inconsistency
Common cases
Upper case vs lower case
Inconsistency in domain value representation (ex. 0 no 1 yes)
Detecting and fixing
Investigate unique domain values (unique())
Make representation consistent (replace)
Misspelling
Investigate unique domain values (unique)
String matching
Calculate domain value frequencies (value_counts())
For all values, find matches for the infrequent values
Replace infrequent values with the best match (if it exist) from the more frequent values
Irregularities
Invalid dates
Domain dependent value, value not valid for a specific domain (ex. Negative value for number of passengers)
Integrity constraint violation
Sold date vs advertised date
One field is the sum of the other two
Land size must be greater than building size
Duplications
Complete duplication
Duplication due to field missing (diff record have diff piece of info)
Missing values
Imputation
Mean and mode
Regression (find variables that are closely related
Dummy value
Removal
Outliers
Detecting data quality issues
1. Investigate unique domain values (unique)
2. Investigate value ranges for the column
3. Type casting (ex. Parse date string to datetime object to catch exceptions (pandas.to_datetime)
4. Highly dependent on the domain and problems
5. Identifying keys to check duplicates (try different keys)
6. Investigate unique domain values
7. Investigate value range, cautious about extremely small and large values
8. Domain analysis
9. Range of values using df.describe()
10. Graphical tools (ex. boxplot)
11. 3o edit rule
12. Good to do some comparison between results found by different identifiers
Fixing data quality issues
1. Replace
2. Remove
3. Swap
4. Combining information/merge
5. Remove duplicates
6. Imputation
7. Mean and mode
8. Regression (find variables that are closely related
9. Dummy value
10. Removal
11. Similar to handling missing values
Handling missing values 
Imputation
Mean and mode
Regression (find variables that are closely related
Dummy value
Removal
Handling outliers 
Similar to handling missing values
Integrity constraint violation is a common data quality issue that is highly dependent on context
Duplications can be due to complete duplication or duplication due to field missing (diff record have diff piece of info)
Outliers are not easy to find in data
Data quality issues 
Misspelling and inconsistency
Inconsistency
Common cases
Upper case vs lower case
Inconsistency in domain value representation (ex. 0 no 1 yes)
Detecting and fixing
Investigate unique domain values (unique())
Make representation consistent (replace)
Misspelling
Investigate unique domain values (unique)
String matching
Calculate domain value frequencies (value_counts())
For all values, find matches for the infrequent values
Replace infrequent values with the best match (if it exist) from the more frequent values
Irregularities
Invalid dates
Domain dependent value, value not valid for a specific domain (ex. Negative value for number of passengers)
Integrity constraint violation
Sold date vs advertised date
One field is the sum of the other two
Land size must be greater than building size
Duplications
Complete duplication
Duplication due to field missing (diff record have diff piece of info)
Missing values
Extremely small and large values
Outliers
Detecting data quality issues
1. Investigate unique domain values
2. Investigate value ranges for the column
3. Type casting (ex. Parse date string to datetime object to catch exceptions (pandas.to_datetime)
4. Highly dependent on the domain and problems
5. Identifying keys to check duplicates (try different keys)
6. Investigate unique domain values
7. Investigate value range, cautious about extremely small and large values
8. Domain analysis
9. Range of values using df.describe()
10. Graphical tools (ex. boxplot)
11. 3o edit rule
12. Good to do some comparison between results found by different identifiers
Fixing data quality issues
1. Refer to documentation if it exists to see if there's special meaning
2. Replace
3. Remove
4. Swap
5. Combining information/merge
6. Remove duplicates
7. Imputation
8. Mean and mode
9. Regression (find variables that are closely related
10. Dummy value
11. Removal
12. Similar to handling missing values
All depends on situation and needs justification
Challenge: Outliers are not easy to find