4.5.2 Big data software

Cards (8)

  • Big data software

    Software developed to enable big data tasks, often requiring clusters
  • MapReduce
    • Map procedure performs a task, such as sorting data
    • Reduce procedure summarises the outputs from the mapping
    • Framework divides data, sending chunks to each node and receives their results
    • Framework ensures work is shared fairly between nodes and detects faults
  • Google formerly used MapReduce to implement PageRank - the method by which the Google web search engine stores and ranks web pages so that the most useful results tend to appear near the top of the first page
  • PageRank
    Determines the usefulness of a web page based on the number of other pages that link to it
  • Googlebot data processing

    1. Googlebot 'crawler' programs visit billions of websites, recording the destinations of every link on every page
    2. Blocks of Googlebot data arrive and are broken into equally-sized chunks
    3. Framework distributes multiple copies of every block amongst nodes in the cluster
    4. Map procedure compiles long lists of every link on every page in the data block
    5. Shuffling sorts the list of links into alphabetical order of the destination page's address
    6. Reducing calculates the total number of links pointing to a web page
  • Google no longer uses MapReduce for this purpose, but it is still used by many other major companies and science projects to make sense of big data
  • Hadoop
    Free software package that allows anyone to run their own cluster, used for data storage on the internet (cloud storage) and providing computer time to run programs
  • The Hadoop project author, Doug Cutting, named the program after a cuddly yellow toy elephant belonging to his son