Need methods to learn the structure and the parameters of the model from example observations but do not know the underlying (in)depencies
The learning task :
given :
set of random variables
training set D (iid.)
find :
a graph structure
parameter values such that it is a good representation
General approches to learning Graphical models
Constraint-based structure learning : decide (in-)dependencies between variables from the training date -> construct network
Score-based structure learning : define quality score F over models -> search network that maximise
Bayesian model averging
Score-based methods for structure learning needs a scoring function F, a serach algorithm.
model should reflect the true structure of the problem domain + avoid overfitting
To find the maximum likelihood 〈G, θ_G > pair , we search for the graph structure G that achieves the highest likelihood when we use the ML paramters for G
The maximum likelihood score
Problems with the maximum likelihood score :
maximum likelihood network will be fully connected
leads search algorithm to overfit the data
Least-squares regression
basic idea :
describe relation between the features x and the target y vai some function
want to find parameters that minimise the mean squared error (MSE) between predicted and true values on the training set
Under- vs Overfitting
A) underfits
B) high bias
C) overfits
D) high variance
underfitting consequence : large error - on training set , but possible also on new data
overfitting consequence : poor generalisation to new cases
Controlling overfitting via regularisation : permit large class of possibly complex models but : punish complex models by putting a penalty on the size of the parameters.
Controlling overfitting via regularisation
The General Approach : penalise all paramters for large values
A) regularisation term
B) regularisation paramter
Why does punishing large parameter values lead to simples models :
model can afford only very few large paramter values
but the leraning algorithm can choose which parameters to keep large
dimensions without a big benefit w.r.t error will be suppressed
Finding the parameters with the highest posterior probability means minimising a weighted sum of the mean squared error and the sum of the squares of the paramter values
Regularisation denoted methods that try to prevent overfitting, typically by introducing some penalty for model complexity into the objective function to be optimised
Bayes Information Criterion (BIC)
A) Dirichlet prior
B) log-likelihood
C) size
D) independant
BIC-scores interpretation 1. finding a good model == maximising the BIC, first term measures fit to the data, second term reflects the complexity
BIC interpretation 2 :
negation BIC score
first term : interpreted as the number of bits needed to encode to model
second term : number of bits neeeded to encode the trianing data,given the model
good model == minimising the sum
Problem with score based structure learning as a search problem : search space is enormous,
2 possible approaches to search alogrithms :
Complete serach in a stongly restricted space
Heuristic search in the full space
Complete search : For each Xi determine its parents Pa(Xi) among the earlier variables so as to produce optimal imrpovement according to score
extremely high biais
Heuristic search :
resulting graph structure must remain acyclic
each operator application includes an acyclicity check on the graph
! no guarantee that we will find anything close to an optimal solution !
Possible heuristics search algorithms :
Greedy local search algorithms
randomised local serch algortihms
Greedy search:
Start : some intial randomly selected structure and score it
Iterate :
Generate all neighbours of current strucute
apply all possible legal search operations
evaluate all these via the socring
select the best of them
make that the new current state
until : none of the new structure produces an improvement
will get stuck in a local maximum of the scoring function
Greedy local search is expensive
improve by:
exploit decomposability of the scoring function
cache components of scoring functions already computed