Big Data

Cards (20)

  • Big data is a catch-all term to describe vast amounts of data. It has three defining features, the 'three Vs': volume, velocity, and variety
  • Volume refers to the quantity of data. There is too much to fit on a single device or server
  • Velocity means data must be created and modified rapidly, and respond to queries within milliseconds
  • Variety means data must be in many different forms, be it structured, unstructured, binary files, multimedia files, etc
  • Variety is actually the biggest issue, not volume. Big data's unstructured nature makes it difficult to analyse; machine learning is used to analyse big data and extract useful information
  • Relational databases cannot work as they require data to be fit into a row-and-column format, and do not scale well across multiple machines
  • Velocity is important in big data because some of it is time-sensitive, such as streaming
  • When big data is too big for one machine, it must be distributed across multiple machines. These may not be in the same location or even country.
  • A solution to the problems of big data is functional programming, because it is easier to write correct code that can be distributed across multiple servers
  • Functional languages are stateless, meaning nothing is dependent on how often something is called or in what order. Thus it is easier to write correct code and predict how it will behave
  • Higher-order functions are easily parallelised meaning multiple processors can work on different parts of a data set at the same time
  • Functional languages are immutable meaning an object cannot be modified once created. (Everything is a constant). This makes parallel processing easy, because there are no side effects so anything can happen in any order. Multiple functions can work on the same input without affecting the original
  • Big data can be represented with the fact-based model, where each piece of information is stored as a fact. A fact is immutable and cannot be overwritten or deleted.
  • Facts have a timestamp attached, giving the time and date of when it was created. This can be compared against other facts - for example if a status needs to change from 'engaged' to 'married', both will exist as facts, but 'married' will have a later timestamp and thus be the fact that is used
  • Data is not indexed in the fact-based model; it is appended to the dataset
  • Errors are easy to correct in the fact-based model by simply returning to prior 'good' facts
  • Facts are atomic, meaning they hold a single piece of information
  • We can represent big data visually using a graph schema. Nodes represent entities and contain their properties, and edges represent the relationships between entities
  • Graph schema rarely includes timestamps, so assume nodes are the most recent information
  • Properties may also be depicted as rectangles connected to nodes with dotted lines