(1.4) Big Data

Cards (21)

  • Like the explosion of interest in analytics, interest in what is known as big data has recently increased dramatically.
  • Big data is simply a set of data that cannot be managed, processed, or analyzed with commonly available software in a reasonable amount of time.
  • Example of Big Data are:
    • Walmart handles over one million purchase transactions per hour.
    • Facebook processes more than 250 million picture uploads per day.
    • Five billion cell-phone owners around the world generate vast amounts of data by calling, texting, tweeting and browsing the web on a daily basis.
  • As Google CEO Eric Schmidt has noted, the amount of data currently created every 48 hours is equivalent to the entire amount of data created from the dawn of civilization until the year 2003. Perhaps it is not surprising that 90 percent of the data in the world today has been created in the last two years.
  • Businesses are interested in understanding and using data to gain a competitive advantage.
  • Although big data represents opportunities, it also presents analytical challenges from a processing point of view and consequently has itself led to an increase in the use of analytics.
  • More companies are hiring data scientists who know how to process and analyze massive amounts of data.
  • Big data issues are a subset of analytics and that many very valuable applications of analytics do not involve big data.
  • It is through technology that we have truly been thrust into the data age.
  • Because data can now be collected electronically, the available amounts of it are staggering.
  • The term "big data" has been created in the midst of vast amounts of data collection from various sources like the Internet, cell phones, retail checkout scanners, surveillance video, and sensors
  • There is no universally accepted definition of big data, but a commonly accepted one is that it refers to any set of data that is too large or too complex to be handled by standard data-processing techniques and typical desktop software
  • IBM describes big data through the four Vs:
    • Volume
    • Velocity
    • Variety
    • Veracity
  • Volume
    • Data at rest
    • Terabytes to exabytes of existing data to process
  • Velocity
    • Data in Motion
    • Streaming data, milliseconds to seconds to respond
  • Variety
    • Data in Many Forms
    • Structured, unstructured, text, multimedia
  • Veracity
    • Data in Doubt
    • Uncertainly due to data inconsistency & incompleteness, ambiguities, latency, deception, model approximations
  • Volume:
    • Data collected electronically allows for the collection of vast quantities of data
    • Many companies now store over 100 terabytes of data (1 terabyte = 1,024 gigabytes)
  • Velocity:
    • Real-time capture and analysis of data pose challenges in storage and speed of analysis
    • The New York Stock Exchange collects 1 terabyte of data in a single trading session
    • Having current data and real-time rules for trades and predictive modeling are crucial for managing stock portfolios
  • Variety:
    • Companies now collect more complicated types of data in addition to large volumes and high speeds
    • Text data from social media platforms like Twitter, audio data from service calls, and video data from in-store cameras are examples
    • Analyzing nontraditional data sources is complex due to the processing needed to transform data into a numerical form for analysis
  • Veracity:
    • Refers to the uncertainty in data
    • Challenges include missing values, inconsistencies in units of measure, and the lack of reliability of responses leading to bias
    • Ensuring reliable analysis with uncertain data is a significant challenge