Big Data

Created by

Holly Southall

Cards (20)

Big data is a catch-all term to describe vast amounts of data. It has three defining features, the 'three Vs': volume, velocity, and variety
Volume refers to the quantity of data. There is too much to fit on a single device or server
Velocity means data must be created and modified rapidly, and respond to queries within milliseconds
Variety means data must be in many different forms, be it structured, unstructured, binary files, multimedia files, etc
Variety is actually the biggest issue, not volume. Big data's unstructured nature makes it difficult to analyse; machine learning is used to analyse big data and extract useful information
Relational databases cannot work as they require data to be fit into a row-and-column format, and do not scale well across multiple machines
Velocity is important in big data because some of it is time-sensitive, such as streaming
When big data is too big for one machine, it must be distributed across multiple machines. These may not be in the same location or even country.
A solution to the problems of big data is functional programming, because it is easier to write correct code that can be distributed across multiple servers
Functional languages are stateless, meaning nothing is dependent on how often something is called or in what order. Thus it is easier to write correct code and predict how it will behave
Higher-order functions are easily parallelised meaning multiple processors can work on different parts of a data set at the same time
Functional languages are immutable meaning an object cannot be modified once created. (Everything is a constant). This makes parallel processing easy, because there are no side effects so anything can happen in any order. Multiple functions can work on the same input without affecting the original
Big data can be represented with the fact-based model, where each piece of information is stored as a fact. A fact is immutable and cannot be overwritten or deleted.
Facts have a timestamp attached, giving the time and date of when it was created. This can be compared against other facts - for example if a status needs to change from 'engaged' to 'married', both will exist as facts, but 'married' will have a later timestamp and thus be the fact that is used
Data is not indexed in the fact-based model; it is appended to the dataset
Errors are easy to correct in the fact-based model by simply returning to prior 'good' facts
Facts are atomic, meaning they hold a single piece of information
We can represent big data visually using a graph schema. Nodes represent entities and contain their properties, and edges represent the relationships between entities
Graph schema rarely includes timestamps, so assume nodes are the most recent information
Properties may also be depicted as rectangles connected to nodes with dotted lines