اساسيات

Cards (247)

  • Data science
    An evolutionary extension of statistics capable of dealing with the massive amounts of data produced today. It adds methods from computer science to the repertoire of statistics.
  • Data scientist
    • Ability to work with big data
    • Experience in machine learning, computing, and algorithm building
    • Tools like Hadoop, Pig, Spark, R, Python, and Java
  • Big data
    A blanket term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques like RDBMS (relational database management systems)
  • Characteristics of big data

    • Volume - How much data is there?
    • Variety - How diverse are different types of data?
    • Velocity - At what speed is new data generated
  • The widely adopted RDBMS has long been regarded as a one-size-fits-all solution, but the demands of handling big data have shown otherwise.
  • Benefits and uses of data science and big data
    • Commercial companies use data science and big data to gain insights into customers, processes, staff, competition, and products
    • Companies use data science to offer customers better user experience, cross-sell, up-sell, and personalize offerings
    • Governmental organizations rely on internal data scientists to discover valuable information and share their data
  • Volume
    How much data is there?
  • Variety
    How diverse are different types of data?
  • Velocity
    At what speed is new data generated
  • Benefits and uses of data science and big data
    • Commercial companies use data science and big data to gain insights into their customers, processes, staff, completion, and products
    • Many companies use data science to offer customers a better user experience, as well as to cross-sell, up-sell, and personalize their offerings
    • Governmental organizations are also aware of data's value and rely on internal data scientists to discover valuable information, and also share their data with the public
    • Nongovernmental organizations (NGOs) use data to raise money and defend their causes
    • Universities use data science in their research but also to enhance the study experience of their students, and the rise of massive open online courses (MOOC) produces a lot of data which allows universities to study how this type of learning can complement traditional classes
  • Structured data

    Data that depends on a data model and resides in a fixed field within a record
  • Structured data is often easy to store in tables within databases or Excel files
  • SQL, or Structured Query Language, is the preferred way to manage and query data that resides in databases
  • Categories of data
    • Structured
    • Unstructured
    • Natural language
    • Machine-generated
    • Graph-based
    • Audio, video, and images
    • Streaming
  • Unstructured data

    Data that isn't easy to fit into a data model because the content is context-specific or varying
  • Example of unstructured data
    • Regular email
  • Unstructured data in email
    • Contains structured elements like sender, title, and body text
    • But challenging to find number of people who wrote email complaints about a specific employee due to many ways to refer to a person
  • Natural language
    A special type of unstructured data that is challenging to process as it requires knowledge of specific data science techniques and linguistics
  • Natural language processing
    • Has had success in entity recognition, topic recognition, summarization, text completion, and sentiment analysis
    • But models trained in one domain don't generalize well to other domains
    • Even state-of-the-art techniques can't decipher the meaning of every piece of text
  • Machine-generated data
    Information automatically created by a computer, process, application, or other machine without human intervention
  • Analysis of machine data
    • Relies on highly scalable tools due to high volume and speed
    • Examples include web server logs, call detail records, network event logs, and telemetry
  • ﻞﻜﺸﺑ ﻢﻤﻌﺗ ﻻ ﺪﺣاو لﺎﺠﻣ ﻲﻓ ﺔﺑرﺪﻤﻟا جذﺎﻤﻨﻟا ﻦﻜﻟو ،ﺮﻋﺎﺸﻤﻟا ﻞﯿﻠﺤﺗو ،ﺺﻨﻟا ﻰﻠﻋ ﺪﯿﺟ
  • Even state-of-the-art techniques aren't able to decipher the meaning of every piece of text
  • Machine-generated data
    Information that's automatically created by a computer, process, application, or other machine without human intervention
  • Machine-generated data is becoming a major data resource and will continue to do so
  • Analysis of machine data
    • Relies on highly scalable tools, due to its high volume and speed
  • Examples of machine data
    • web server logs
    • call detail records
    • network event logs
    • telemetry
  • Graph data
    Data that focuses on the relationship or adjacency of objects
  • Graph-based data

    • A natural way to represent social networks
    • Allows calculation of specific metrics such as the influence of a person and the shortest path between two people
  • Examples of graph-based data
    • LinkedIn - who you know at which company
    • Twitter - your follower list
  • Audio, image, and video are data types that pose specific challenges to a data scientist
  • Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers
  • LinkedIn
    You can see who you know at which company
  • Twitter follower list
    Another example of graph-based data
  • Graph-based data

    Data that is based on a graph
  • Graph theory
    The study of graphs
  • MLBAM (Major League Baseball Advanced Media)
    • Announced in 2014 they'll increase video capture to approximately 7 TB per game for the purpose of live, in-game analytics
  • High-speed cameras at stadiums will capture ball and athlete movements to calculate in real time, for example, the path taken by a defender relative to two baselines
  • Recently a company called DeepMind succeeded at creating an algorithm that's capable of learning how to play video games
  • The algorithm takes the video screen as input and learns to interpret everything via a complex process of deep learning