An evolutionary extension of statistics capable of dealing with the massive amounts of data produced today. It adds methods from computer science to the repertoire of statistics.
Data scientist
Ability to work with big data
Experience in machine learning, computing, and algorithm building
Tools like Hadoop, Pig, Spark, R, Python, and Java
Big data
A blanket term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques like RDBMS (relational database management systems)
Characteristics of big data
Volume - How much data is there?
Variety - How diverse are different types of data?
Velocity - At what speed is new data generated
The widely adopted RDBMS has long been regarded as a one-size-fits-all solution, but the demands of handling big data have shown otherwise.
Benefits and uses of data science and big data
Commercial companies use data science and big data to gain insights into customers, processes, staff, competition, and products
Companies use data science to offer customers better user experience, cross-sell, up-sell, and personalize offerings
Governmental organizations rely on internal data scientists to discover valuable information and share their data
Volume
How much data is there?
Variety
How diverse are different types of data?
Velocity
At what speed is new data generated
Benefits and uses of data science and big data
Commercial companies use data science and big data to gain insights into their customers, processes, staff, completion, and products
Many companies use data science to offer customers a better user experience, as well as to cross-sell, up-sell, and personalize their offerings
Governmental organizations are also aware of data's value and rely on internal data scientists to discover valuable information, and also share their data with the public
Nongovernmental organizations (NGOs) use data to raise money and defend their causes
Universities use data science in their research but also to enhance the study experience of their students, and the rise of massive open online courses (MOOC) produces a lot of data which allows universities to study how this type of learning can complement traditional classes
Structured data
Data that depends on a data model and resides in a fixed field within a record
Structured data is often easy to store in tables within databases or Excel files
SQL, or Structured Query Language, is the preferred way to manage and query data that resides in databases
Categories of data
Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and images
Streaming
Unstructured data
Data that isn't easy to fit into a data model because the content is context-specific or varying
Example of unstructured data
Regular email
Unstructured data in email
Contains structured elements like sender, title, and body text
But challenging to find number of people who wrote email complaints about a specific employee due to many ways to refer to a person
Natural language
A special type of unstructured data that is challenging to process as it requires knowledge of specific data science techniques and linguistics
Natural language processing
Has had success in entity recognition, topic recognition, summarization, text completion, and sentiment analysis
But models trained in one domain don't generalize well to other domains
Even state-of-the-art techniques can't decipher the meaning of every piece of text
Machine-generated data
Information automatically created by a computer, process, application, or other machine without human intervention
Analysis of machine data
Relies on highly scalable tools due to high volume and speed
Examples include web server logs, call detail records, network event logs, and telemetry
Even state-of-the-art techniques aren't able to decipher the meaning of every piece of text
Machine-generated data
Information that's automatically created by a computer, process, application, or other machine without human intervention
Machine-generated data is becoming a major data resource and will continue to do so
Analysis of machine data
Relies on highly scalable tools, due to its high volume and speed
Examples of machine data
web server logs
call detail records
network event logs
telemetry
Graph data
Data that focuses on the relationship or adjacency of objects
Graph-based data
A natural way to represent social networks
Allows calculation of specific metrics such as the influence of a person and the shortest path between two people
Examples of graph-based data
LinkedIn - who you know at which company
Twitter - your follower list
Audio, image, and video are data types that pose specific challenges to a data scientist
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers
LinkedIn
You can see who you know at which company
Twitter follower list
Another example of graph-based data
Graph-based data
Data that is based on a graph
Graph theory
The study of graphs
MLBAM (Major League Baseball Advanced Media)
Announced in 2014 they'll increase video capture to approximately 7 TB per game for the purpose of live, in-game analytics
High-speed cameras at stadiums will capture ball and athlete movements to calculate in real time, for example, the path taken by a defender relative to two baselines
Recently a company called DeepMind succeeded at creating an algorithm that's capable of learning how to play video games
The algorithm takes the video screen as input and learns to interpret everything via a complex process of deep learning