Part 5b : Structure learning in Bayesian Networks

Cards (26)

  • Need methods to learn the structure and the parameters of the model from example observations but do not know the underlying (in)depencies
  • The learning task :
    given :
    • set of random variables
    • training set D (iid.)
    find :
    • a graph structure
    • parameter values such that it is a good representation
  • General approches to learning Graphical models
    1. Constraint-based structure learning : decide (in-)dependencies between variables from the training date -> construct network
    2. Score-based structure learning : define quality score F over models -> search network that maximise
    3. Bayesian model averging
  • Score-based methods for structure learning needs a scoring function F, a serach algorithm.
    model should reflect the true structure of the problem domain + avoid overfitting
  • To find the maximum likelihood 〈G, θ_G > pair , we search for the graph structure G that achieves the highest likelihood when we use the ML paramters for G
  • The maximum likelihood score
  • Problems with the maximum likelihood score :
    • maximum likelihood network will be fully connected
    • leads search algorithm to overfit the data
  • Least-squares regression
    basic idea :
    • describe relation between the features x and the target y vai some function
    want to find parameters that minimise the mean squared error (MSE) between predicted and true values on the training set
  • Under- vs Overfitting
    A) underfits
    B) high bias
    C) overfits
    D) high variance
  • underfitting consequence : large error - on training set , but possible also on new data
    overfitting consequence : poor generalisation to new cases
  • Controlling overfitting via regularisation : permit large class of possibly complex models but : punish complex models by putting a penalty on the size of the parameters.
  • Controlling overfitting via regularisation
  • The General Approach : penalise all paramters for large values
    A) regularisation term
    B) regularisation paramter
  • Why does punishing large parameter values lead to simples models :
    • model can afford only very few large paramter values
    • but the leraning algorithm can choose which parameters to keep large
    • dimensions without a big benefit w.r.t error will be suppressed
  • Finding the parameters with the highest posterior probability means minimising a weighted sum of the mean squared error and the sum of the squares of the paramter values
  • Regularisation denoted methods that try to prevent overfitting, typically by introducing some penalty for model complexity into the objective function to be optimised
  • Bayes Information Criterion (BIC)
    A) Dirichlet prior
    B) log-likelihood
    C) size
    D) independant
  • BIC-scores interpretation 1. finding a good model == maximising the BIC, first term measures fit to the data, second term reflects the complexity
  • BIC interpretation 2 :
    • negation BIC score
    • first term : interpreted as the number of bits needed to encode to model
    • second term : number of bits neeeded to encode the trianing data,given the model
    • good model == minimising the sum
  • Problem with score based structure learning as a search problem : search space is enormous,
  • 2 possible approaches to search alogrithms :
    1. Complete serach in a stongly restricted space
    2. Heuristic search in the full space
  • Complete search : For each Xi determine its parents Pa(Xi) among the earlier variables so as to produce optimal imrpovement according to score

    extremely high biais
  • Heuristic search :
    • resulting graph structure must remain acyclic
    • each operator application includes an acyclicity check on the graph
    ! no guarantee that we will find anything close to an optimal solution !
  • Possible heuristics search algorithms :
    • Greedy local search algorithms
    • randomised local serch algortihms
  • Greedy search:
    Start : some intial randomly selected structure and score it
    Iterate :
    • Generate all neighbours of current strucute
    • apply all possible legal search operations
    • evaluate all these via the socring
    • select the best of them
    • make that the new current state
    until : none of the new structure produces an improvement
    will get stuck in a local maximum of the scoring function
  • Greedy local search is expensive
    improve by:
    1. exploit decomposability of the scoring function
    2. cache components of scoring functions already computed