data mining 2

Cards (50)

  • it is true that an indiscriminate (aveugle) addition of low-quality (mouvaise) data and input features might introduce too much noise (bruit) and, at the same time, considerably slow down (ralentir) the training algorithm.
  • Dimensionality reduction

    Techniques that reduce the number of input, features, attributes, variables or dimensions in a dataset while preserving important information
  • High-dimensional data

    • The performance of the model deteriorates as the number of features increases
    • The complexity of the model increases with the number of features
    • It becomes more difficult to find a good solution
    • It can also lead to overfitting
  • Dimensionality reduction techniques

    • Remove redundant features and reduce the complexity of a model
    • Improve the performance of a machine learning algorithm
    • Make it easier to visualize the data
    • Data compression, and hence reduced storage space
  • Dimensionality reduction methods

    • Feature selection
    • Feature extraction
  • Feature selection

    Selecting a subset of the original features that are most relevant to the problem at hand
  • Feature selection methods
    • Filter methods
    • Wrapper methods
    • Embedded methods
  • Feature extraction

    Creating new features by combining or transforming the original features to capture the essence of the original data in a lower-dimensional space
  • The good solution would be a subset of predictive variables where: They are strongly correlated with the target variable (Relevance) and they are weakly correlated with each other (ideally orthogonal two by two) (Redundancy)
  • Advantages of feature selection

    • Eliminate variables that have nothing to do with the problem being addressed (relevance)
    • Eliminate variables that are duplicates, i.e. provide the same type of information (redundancy)
    • Facilitate the interpretation of results (better identify the impact of variables on the result)
    • Facilitate the deployment of models: fewer variables = less information to find to apply the model
    • Robustness. Principle of parsimony: for identical performance, the simpler model (less training data) will be more robust in the population
  • Feature selection techniques

    • Missing Values Ratio
    • Low Variance Filter
    • High Correlation Filter
    • Backward Feature Elimination
    • Forward Feature Construction
  • Measure the relationship (correlation) between qualitative variables

    The correlation ratio varies between 0 and 1, calculated from: Joint and marginal frequencies, Entropy (~ standard deviation [dispersion]), Mutual information (~ covariance [relationship])
  • Measure the relationship (correlation) between quantitative predictors and qualitative target variable

    The correlation ratio varies between 0 and 1, calculated from: Residual dispersion, Dispersion explained by group membership, Total dispersion, Conditional mean
  • Does this ranking and selection method be able to remove these variables and choose the best predictive variables here?
  • Feature extraction techniques

    • Principal component analysis (PCA)
    • Linear discriminant analysis (LDA)
    • Generalized Discriminant Analysis (GDA)
    • Singular value decomposition (SVD)
    • Neural Autoencoder (NA)
    • t-distributed stochastic neighbor embedding (t-SNE)
  • PCA
    Projects the data onto a lower-dimensional space by finding the directions of maximum variance
  • PCA tends to find linear correlations between variables, which is sometimes undesirable.
  • PCA
    1. Construct the covariance matrix of the data
    2. Compute the eigenvectors of this matrix
    3. Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of variance of the original data
  • The most important variances should be retained by the remaining eigenvectors, even though there might have been some information loss in the process.
  • Dimensionality reduction can be used as a preprocessing step before applying machine learning algorithms
  • Linear Discriminant Analysis (LDA)

    Finds a succession of linear combinations of the initial variables (called latent or discriminant variables, which are orthogonal to each other) that best distinguish the groups
  • The objective is to find the coefficients of the discriminant variable (or factorial axis) z that maximize the following correlation ratio:
  • Python modules, packages, libraries and platforms for data frame operations
  • Pandas
    A widely used data science library that allows you to: Read data in different formats, Simplify data manipulations, Aggregate and merge data easily, Perform statistical calculations, Name columns and rows for better readability
  • NumPy
    A fundamental library for: Performing numerical calculations with Python, Creating arrays of different dimensions, Treating arrays that store values of the same data type and facilitating the execution of mathematical and logical operations on the arrays
  • Lire des données

    Fichiers CSV et texte, Microsoft Excel, bases de données SQL
  • Simplifier les manipulations de données

    Valeurs manquantes, colonnes, etc.
  • Agréger et fusionner les données

    Fonctions groupby, agg et merge
  • Effectuer des calculs statistiques

    Moyenne, médiane, variance, somme
  • Nommer les colonnes et les lignes offre plus de lisibilité dans les projets
  • Numpy
    Librairie fondamentale pour effectuer des calculs numériques, créer des tableaux de différentes dimensions, traiter des tableaux qui stockent des valeurs du même type de données, fournir un large éventail de fonctionnalités (algèbre linéaire, transformées de Fourier, génération de nombres aléatoires, etc.)
  • Matplotlib
    Librairie qui permet de créer des graphiques en une ligne de code pour visualiser les données
  • Seaborn
    Extension de Matplotlib, outil plus puissant pour tracer des graphiques relationnels entre deux vecteurs et des graphiques de distribution, visualiser des modèles statistiques, concevoir un tableau de bord
  • Plotly
    Outil web de visualisation des données qui offre de nombreux graphiques utiles prêts à l'emploi
  • Scikit learn (sklearn)

    Module de machine learning le plus populaire, permet de préparer les données, choisir et paramétrer des algorithmes d'apprentissage automatique, traiter les tâches standard de Machine Learning et d'exploration de données
  • TensorFlow
    Framework Python populaire pour le Machine Learning et le Deep Learning, permet de travailler avec des réseaux neuronaux artificiels
  • Keras

    Framework idéal pour implémenter des modèles complexes de Deep learning, plus souple et plus facile à prendre en main que TensorFlow
  • SciPy
    Bibliothèque utile comprenant des modules pour l'algèbre linéaire, l'intégration, l'optimisation et les statistiques
  • Statsmodels
    Librairie qui fournit des classes et des fonctions pour la réalisation de tests statistiques et l'exploration de données statistiques
  • Requests
    Permet d'effectuer la première étape du web-scraping : le téléchargement du code HTML de la page