Data must first be transformed (wrangled) , before a company can use it properly, this usually involves changing raw data into a standardized format version of the data
The first step of data wrangling is to understand the data and fully understand what the data is about - this is called discovery
Data wrangling can reduce algorithmic bias as it makes a dataset more accurate for its purposes
The third step of data wrangling is to remove biased data, or inaccurate data - this is called cleansing
The second step of data wrangling is to structure the data, this makes the data easier to access -this is the structure part
The fourth step of data wrangling is to enrich it with anything that will help it meet specified needs. This is because external data rarely has all the required parts readily available. This is enrichment
The fifth step of data wrangling is to validate the dataset, this is where the data is checked for it's reliability, quality, and safety. This often involves ensuring the data is complete and meets given field - this is called validation
The sixth step of data wrangling is to use the data, and is for when the data is full and complete, which are both found when data is assured - this is known as publishing
All data systems will have the same core functions; different organization use these functions in different ways depending on how the data shall be processed and analysed, as well as what data will be used
The core functions of all data systems are:
Input
Search
Save
Integrate
Organise
Output
Feedback loop
For data systems what is input?
collecting raw data
For data systems what is search?
Searches ensure data meets the needs of an organisation
For data systems what is save?
Storing data in a system to be used again
For data systems what is integrate?
Integrating different forms of data into a single location, allowing for a completeoutput
Database normalization is the process of organizing data so that it can be easily accessed and updated while minimizing redundancy.
Data warehousing is an approach to storing large amounts of structured data from multiple sources into one central repository
For data systems organise is?
organising and indexing saved data to ensure it meets the end users requirements
For data systems what is output?
The processed and analysed data, is sent to relevant people
For data systems what is the feedback loop?
Measuring the outputs to evaluate the process effectiveness
Data has to be inputted into a digital system. This is often done by combining data stores. However, the origin of most digital data is that a human manually inputted each value.
There are two main error types that may occur during data entry, when inputting it manually. These are primarily:
Transcription errors - when data is inputted with an incorrect character such as a hitting two keys at once. Such as Stuart being typed as Styart
Transposition errors - these occur when data is inputted in reverse such as Stuart being typed as Staurt
How are data entry errors reduced?
By validating and verifying inputted data
What is validation?
Checking data is suitable and meets pre-set rules
What is verification?
Seeing if data being entered into the digital system is identical to the source
Validation techniques can be used on any data entry to reduce the risk of errors such as on an online form
It is good to match validation with a error message if data inputted is invalid
If data entered is incorrect this can lead to GiGo, which stands for garbage in garbage out. This means that when data is processed it will be incorrect as the data inputted was incorrect
When data needs to be entered for a large industry, data-entry screens are developed
data-entry screens must be made suitable for the industry by the developer understanding the needs of an industry.
For example for data-entry screens data often needs to be formatted correctly and therefore rules must be produced to avoid erroneous data entry
When data has been entered into digital systems it must be maintained, and this can be done in various ways:
Carrying out regular scheduled searches to remove redundant or expired data
Regularly updating data when it may change over time such as a mail list
A company, legally, must be able to maintain data. As whenever A user requests to remove data, the company must oblige (right to be forgotten). Therefore they must be able to find the users data and remove it, which is a form of maintenance.
Additionally other data subject needs such as the right to rectify data, for example if info of them changes
Once data has been inputted, processed and analysed it has to be output (presented), in a format that makes the data helpful to the end users. The main 4 ways information is presented:
Graphs/charts
Data tables
Reports
Infographics
Graphs and charts are best used to present numerical data, it is possible for the end user to misinterpret them and therefore it's design must be considered
Data tables are useful when data is related, such as the percentage increase in train users across different regions of England
Data tables thrive with small amounts of data, as this makes it easy to interpret. They must be properly labelled and often coloured to present a message.
Reports are generally written info regarding data patterns and trends, it is often used when the end user is presenting info to those who may not have context and therefore it allows the user to curate what they wish to present
Infographics are a great way of making data more memorable and more comparable to the real world. They are often cover an entire topic giving minimal detail but instead a general idea
All data that is gthered, processed, and analysed must have a high level of reliability and quality
Data assurance checks that data isn't unreliable or of bad quality