Data Dictionary


A

Access control

Features such as password protection, user permissions, and encryption that are used to protect a spreadsheet.

Action-oriented Question

A question whose answers lead to change.

Administrative metadata

Metadata that indicates the technical source of a digital asset.

Agenda

A list of scheduled appointments.

Algorithm

A process or set of rules followed for a specific task.

Analytical Skills

Qualities and characteristics associated with using facts to solve problems.

Analytical Thinking

The process of identifying and defining a problem, then solving it by using data in an organized, step-by-step manner.

Attribute

A characteristic or quality of data used to label a column in a table.

Audio file

Digitized audio storage, usually in an MP3, AAC or other compressed format.

AVERAGE

A spreadsheet function that returns an average of the values from a selected range.

Back to Top

B

Bad Data Source

A data source that is not reliable, original, comprehensive, current and cited (ROCCC)

Bias

A conscious or unconscious preference in favor of or against a person, group of people, or thing.

Big Data

Large, complex datasets typically involving long periods of time, which enable data analysts to address far-reaching business problems.

Boolean data

A data type with only 2 possible values, usually true or false.

Borders

Lines that can be added around two or more cells on a spreadsheet.

Business task

The question or issue that data analysis resolves for a business.

C

CASE

The CASE statement goes through one or more conditions and returns a value as soon as a condition is met.

SELECT

[COLUMN]

CASE

WHEN [COLUMN] = ' [ORIGINAL VALUE]' THEN '[EXPECTED VALUE]'

WHEN [COLUMN] = ' [ORIGINAL VALUE]' THEN '[EXPECTED VALUE]'

WHEN [COLUMN] = ' [ORIGINAL VALUE]' THEN '[EXPECTED VALUE]'

ELSE [COLUMN]

END AS [COLUMN]

FROM

[DATABASE.TABLE]

CAST()

It can be used to convert anything from one data type to another.

Causation

The idea that an event leads to a specific outcome.

Cell Reference

A cell or range of cells in a worksheet, typically used in formulas and functions.

Change Log

A file containing a chronologically ordered list of modifications made to a project.

Change Log - Code Commit

The analyst submits the updated version of the query to a repository in the company's version control system. Document exactly what the change was, and why it was made.

Change Log - Syncing

An analyst makes sure the most up-to-date version of the query is the one they will change.

Change Log - Query

The analyst may ask someone to review the change, which is a code review, and this can be formal or informal.

Charts

Choosing a Chart

Evaluate the data and then choose one that matches the pattern(s):

Change

This is a trend or instance of observations that become different over time. A great way to measure change in data is through a line or column chart.

Clustering

A collection of data points with similar or different values. This is best represented through a distribution graph.

Relativity

These are observations considered in relation or proportionally to something else. You have probably seen examples of relativity data in a pie chart.

Ranking

This is a position in a scale of achievement or status. Data that requires ranking is best represented by a column chart.

Correlation

This shows a mutual relationship or connection between two or more things. A scatter plot is an excellent way to represent this type of data pattern.

Types of Charts

Column

Column charts use size to contrast and compare two or more values, using height or lengths to represent the specific values.

Distribution graph

Displays the spread of various outcomes in a dataset.

Heatmap

Similar to bar charts, heatmaps also use color to compare categories in a data set. They are mainly used to show relationships between two variables and use a system of color-coding to represent different values. The following heatmap plots temperature changes for each city during the hottest and coldest months of the year.

Histogram

Shows how often data values fall into certain ranges.

Line

A line chart is used to track changes over short and long periods of time. When smaller changes exist, line charts are better to use than bar graphs. Line charts can also be used to compare changes over the same period of time for more than one group.

Pie

A circular graph that is divided into segments representing proportions corresponding to the quantity it represents, especially when dealing with parts of a whole.

Scatter plot

Show relationships between different variables. Scatter plots are typically used for two variables for a set of data, although additional variables can be displayed.

Reference - Charts

Google Charts - Page with examples, definitions and other information.

Clean Data

Data that is complete, correct, and relevant to the problem you're trying to solve.

Cloud

A place to keep data online, rather than a computer hard drive.

Compatibility

How well two or more datasets can work together.

Concatenate

A function that joins multiple text strings into a single string.

Conditional Formatting

A spreadsheet tool that changes how cells appear when values meet specific conditions.

Confirmation Bias

The tendency to search for or interpret information in a way that confirms pre-existing beliefs.

Confidence Level

The probability that your sample size accurately reflects the greater population.

Having a 99% confidence level is ideal, but most industries hope for at least 90 - 95% percent confidence level.

Consent

The aspect of ethics that presumes an individual's right to know how and why their personal data will be used before agreeing to provide it.

Context

The condition in which something exists or happens.

Continuous Data

Data that is measured and can have almost any numerical value.

Correlation

In statistics, is the measure of the degree to which two variables move in relationship to each other.

If one variable goes up and the other variable also goes up, it is a positive correlation. If one variable goes up and the other variable goes down, it is a negative or inverse correlation. If one variable goes up and the other variable stays about the same, there is no correlation.

Remember

  • Critically analyze any correlations that you find.

  • Examine the data’s context to determine if a causation makes sense (and can be supported by all the data).

  • Understand the limitations of the tools that you use for analysis.

Cookie

A small file stored on a computer that contains information about its users.

COUNT

Returns the number of numeric values in a dataset (Google)

A spreadsheet function that counts the number of cells in a range that meet specified criteria. (Coursera)

Counts the numerical values within a specified range. (Coursera)

COUNTA

A function that counts the total number of values within a specified range.

Returns the number of values in a dataset (Google)

CSV (comma separated value)

A delimited text file that uses a comma to separate values.

COUNTIF - Function

A function that returns the number of cells that match a specified value.

Counts the number of times a value occurs in a range of cells.

Currency

The aspect of data ethics that presumes individuals should be aware of financial transactions resulting from the use of their personal data and the scale of those transactions.

D

Dashboard

A tool that monitors live, incoming data.

Data

A collection of facts.

Principles of Data Integrity

Accuracy

The degree of conformity of a measure to a standard or a true value.

Completeness

The degree to which all required measures are known.

Consistency

The degree to which a set of measures is equivalent across systems.

Validity

The concept of using data integrity principles to ensure measures confirm to defined business rules or constraints.

Decision Trees

Decision-making tool that allows you, the data analyst, to make decisions based on key questions that you can ask yourself. Each question in the visualization decision tree will help you decide about critical features for your visualization.

Begin with your story

Start by evaluating the type of data and ask a series of questions:

Does your data have only one numeric variable?

If you have data that has one, continuous, numerical variable, then a histogram or density plot are the best methods of plotting your categorical data. Depending on your type of data, a bar chart can even be appropriate in this case.

Are there multiple datasets?

For cases dealing with more than one set of data, consider a line or pie chart for accurate representation of your data. A line chart will connect multiple data sets over a single, continuous line, showing how numbers have changed over time. A pie chart is good for dividing a whole into multiple categories or parts.

Are you measuring changes over time?

A line chart is usually adequate for plotting trends over time. However, when the changes are larger, a bar chart is the better option.

Do relationships between the data need to be shown?

When you have two variables for one set of data, it is important to point out how one affects the other. Variables that pair well together are best plotted on a scatter plot. However, if there are too many data points, the relationship between variables can be obscured, so a heat map can be a better representation in that case. If you are measuring the population of people across all 50 states in the United States, your data points would consist of millions, so you would use a heat map.

Reference - Charts

Choosing the right type of chart - YouTube

From Data to Viz - Note Copyright is dated 2018.

Common Data Errors

Human error in data entry.

Flawed processes.

System issues.

Data Analysis

The collection, transformation, and organization of data to draw conclusions, make predictions, and drive informed decision-making.

Data Analysis Process

The six phases of ask, prepare, process, analyze, share and act whose purpose is to gain insights that drive informed decision-making.

Reference - Data Analysis

What is Data Analysis: Methods, Process and Types Explained - SimpliLearn

What Is the Data Analysis Process? 5 Key Steps to Follow - G2

Data analysis - Wikipedia

10 Google Workspace tips to analyze data - Google

Data Analyst

Someone who collects, transforms, and organizes data to draw conclusions, make predictions and drive informed decision-making.

Data Analytics

The science of data.

Data Anonymization

The process of protecting people's private or sensitive data by eliminating identifying information.

Data Bias

When a preference in favor or against a person, group of people, or thing systematically skews data analysis results in a certain direction.

Data Cleaning

Overview

Document Errors

Back up the data before cleaning.

Address / fix the source of the error.

Analyze the system before cleaning the data.

Understand the root cause for dirty data & understand where the errors came from.

Keep business objectives in mind.

Check for spelling errors.

Check for misfielded errors | A misfielded value happens when the values are entered into the wrong field.

Check for missing values.

Include timeline to clean the data in the initial deadline / process.

Look at the whole picture

Data Cleaning Process

Confirm and change Text as Lower / Upper / Proper Case.

Remove blank Cells.

Remove formatting.

Remove the extra spaces via trimming the whitespace.

Transpose the data from long to wide.

Use pivot tables and charts to visualize and look for errors.

Pitfalls

Forgetting to document errors.

Looking at a subset of data and not the whole picture.

Losing track of the business objective(s).

Not account for data cleansing in your deadlines / process.

Not analyzing the system before data cleaning.

Not backing up the data before data cleansing.

Not checking for misfielded values.

Not checking for spelling errors.

Not fixing the source of the error(s).

Overlooking missing values.

Questions to Ask when importing / merging data from multiple sources:

Do I have all the data I need?

Does the data exist within these datasets?

Does the data need to be cleaned, or are they ready to be used?

Are the data sets cleaned to the same standard?

Data Cleaning Tools

Data Validation

Conditional Formatting

COUNTIF

Sorting

Filtering

Reference - Cleaning Data

Top ten ways to clean your data - Microsoft

10 Google Workspace tips to clean up data - Google

Data Compatibility

How well two or more datasets can work together.

Data Constraints

Accuracy

Degree to which the data conforms to the actual entity being measured or described.

Completeness

Degree to which the data contains all desired components or measures.

Consistency

Degree to which the data is repeatable from different points of entry or collection.

Cross-field validation

Certain conditions for multiple fields must be satisfied.

Data range

Values must fall between predefined maximum and minimum values.

Data type

Values must be of a certain type: date, number, percentage, Boolean, etc.

Foreign key

Databases only: Values for a column must unique values coming from a column in another table.

Mandatory

Values can't be left blank or empty.

Primary key

Databases only: Value must be unique per column.

Regular expression (regex) patterns

Values must match a prescribed pattern.

Set-membership

Databases only: Values for a column must come from a set of discrete values.

Unique

Values can't have a duplicate.

Data Design

How information is organized.

Data-Driven decision-making

Using facts to guide business strategy.

Data Ecosystem

The various elements that interact with one another to produce, manage, store, organize, analyze, and share data.

Data Element

A piece of information in a dataset.

Data Engineers

Transform data into a useful format for analysis and give it a reliable infrastructure.

Data Ethics

Well-founded standards of right & wrong that dictate how data is collected, shared, and used.

Data Governance

A process for ensuring the formal management of a company's assets.

Data-inspired decision-making

Exploring different data sources to find out what they have in common.

Data: Insufficient Data - Types of

Data from only one source

Data that keeps updating and is incomplete

Outdated Data

Do the analysis using proxy data along with actual data.

Adjust your analysis to align with the data you already have and not that in the report.

Geographic limited data

No data

Gather the data on a small scale to perform a preliminary analysis, and then request additional time to complete the analysis after you have collected more data.

If there isn’t time to collect data, perform the analysis using proxy data from other datasets—the most common workaround.

Too Little Data

Do the analysis with using proxy data along with actual data.

Adjust the analysis to align with the data you already have.

Wrong Data

  • If you have the wrong data because requirements were misunderstood, communicate the requirements again.

  • Identify errors in the data and, if possible, correct them at the source by searching for a pattern in the errors.

  • If you can’t correct data errors yourself, you can ignore the wrong data and proceed with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias.

Possible workarounds

  • Identify trends with the available data.

  • Wait for more data if time allows

  • Discuss with stakeholders and adjust your objective.

  • Search for a new dataset.

Data Integrity

The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle.

Data Interoperability

The ability to integrate data from multiple sources and a key factor in the successful use of open data among companies and governments.

Data Life Cycle

The sequence of stages that data experiences, which include plan, capture, manage, analyze archive, and destroy.

Long Data

Values that repeat in the first column.

Data in which each row is one time point per subject, so each subject will have data in multiple rows.

So instead of having a column for each year (or other data point) you have a record multiple times for (as an example) each year.

Data Manipulation

The process of changing data to make it more organized and easier to read.

Data Mapping

The process of matching fields from one data source to another.

Data Model

A tool for organizing data elements and how they relate to one another.

Data Privacy

Preserving a data subject's information any time a data transaction occurs.

Data Remove Duplicates

A tool that automatically searches for and eliminates duplicate entries from a spreadsheet.

Data Replication

The process of storing data in multiple locations.

Data Science

A field of study that uses raw data to create new ways of modeling and understanding the unknown.

Data Security

Protecting data from unauthorized access or corruption by adopting safety measure.

Data Strategy

The management of the people, processes, and tools used in data analyses.

Data Transfer

The process of copying data from a storage device to memory, or from one computer to another.

Data Type

An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform.

Data Visualization

The graphical representation of data.

Junk Charts Trifecta Checkup: The Definitive Guide

What is the practical QUESTION?

What does the DATA say?

What does the VISUAL say?

The Trifecta Checkup framework establishes a taxonomy of data viz critique. There are eight types of critiques: each is presented with an icon, and accompanied by an example from a prior post on Junk Charts.

The trifecta | Everything is in sync, and the chart has no weaknesses.

The singles

Type Q

Some charts use a good source of data effectively presented in a visual display. However, the effort fails because of a poorly defined objective, or an unengaging premise.

Type D

Some designs emerge from well-posed and interesting questions, and the graph is well executed. The problem here is the data, which fail to illuminate the question. Typically, the data only tangentially concern the topic, or certain adjustments are wanting, or there is a quality concern.

Type V

Despite having a good data source and an interesting, well-posed problem, the visual design hides or confuses the message. These charts have long provided fodder for Tufte, Wainer and the like.

The Doubles

Type QD

The graphical elements follow best practices, and present the data well. This effort is in vain, because of poor data quality, and an unclear objective.

Type QV

The data has been properly collected and processed. However, the question being addressed has not been clearly defined, and the graphical design fails to bring out the key features of the data.

Type DV

An interesting question has been posed. The data fail to convince, and the cause is not helped by poor execution of the graphical elements.

Triple

These graphical disasters do not get anything right.

Type QDV

These graphical disasters do not get anything right.

Data Visualizations

Data visualization is the graphical representation of data. But why should data analysts care about data visualization?

your audience won’t always have the ability to interpret or understand the complex information that you relay to them, so your job is to inform them of your analysis in a way that is meaningful, engaging, and easy to understand. Part of why data visualization is so effective is because people’s eyes are drawn to colors, shapes, and patterns, which makes those visual elements perfect for telling a story that goes beyond just the numbers.

Four elements of successful data visualizations:

  • Information - Reflects the conclusion you've drawn from the data, which you communicate via the visualization.

  • Story - Adds meaning to the data and makes it interesting.

  • Goals - Makes the data usable and useful.

  • Visual form - creates both beauty and structure.

Source: What Makes a Good visualization.

Best Practices

The audience should know what they are looking at within 5 seconds, and the visualization should be clear and easy to follow.

In the next 5 seconds the audience should understand what the conclusion of the visualization even if they are unfamiliar with the data.

As long as the visualization is not misleading, only show the data that the audience needs to understand the findings.

Effective Data Visualizations

Pre-attentive attributes: marks and channels

Pre-attentive attributes are the elements of a data visualization that people recognize automatically without conscious effort. The essential, basic building blocks that make visuals immediately understandable are called marks and channels.

Marks

M​arks are basic visual objects like points, lines, and shapes. Every mark can be broken down into four qualities:

Position - Where a specific mark is in space relating to a scale or to other marks.

Size - How big, small, long, or tall a mark is.

Shape - Whether a specific object is given a shape that communicates something about it.

Color - What color the mark is.

Channels

C​hannels are visual aspects or variables that represent characteristics of the data. Channels are basically marks that have been used to visualize data. Channels will vary in terms of how effective they are at communicating data based on three elements:

Accuracy - Are the channels helpful in accurately estimating the values being represented?

Pop out - How easy is it to distinguish certain values from others?

Grouping - How good is a channel at communicating groups that exist in the data?

Remember: the more you emphasize different things, the less that emphasis counts. The more you emphasize one single thing, the more that counts.

Design principles

Choose the appropriate visual

One of the first things you have to decide is which visual will be the most effective for your audience. Sometimes, a simple table is the best visualization. Other times, you need a more complex visualization to illustrate your point.

Optimize the data-ink ratio

The data-ink entails focusing on the part of the visual that is essential to understanding the point of the chart. Try to minimize non-data ink like boxes around legends or shadows to optimize the data-ink ratio.

Use orientation effectively

Make sure the written components of the visual, like the labels on a bar chart, are easy to read. You can change the orientation of your visual to make it easier to read and understand.

Color

There are numerous important considerations when considering using color in your visuals. These include using color consciously and meaningfully, staying consistent throughout your visuals, being considerate of what colors mean to different people, and using inclusive color scales that make sense for everyone viewing them.

Numbers of things

Think about how many elements you include in any visual. If your visualization uses lines, try to plot five or fewer. If that isn’t possible, use color or hue to emphasize important lines. Also, when using visuals like pie charts, try to keep the number of segments to less than seven since too many elements can be distracting.

Engage your audience

Data visualization can make complex (and even monotonous) information easily understood, and knowing how to utilize data visualization is a valuable skill to have. Your goal is always to help the audience have a conversation with the data so your visuals draw them into the conversation. This is especially true when you have to help your audience engage with a large amount of data, such as the flow of goods from one country to other parts of the world.

Reference - Data Visualization

The Data Visualization Catalog - Dated Information with last post from 2014.

A project developed by Severino Ribecca to create a (non-code-based) library of different information visualization types.

The 25 Best Data Visualizations of 2020

The 10 Best Data Visualization Blogs To Follow - Tableau

Data Studio Report Gallery - Google Data Studio

IBM Blogs - IBM

Oracle Blogs - Oracle

Data Warehouse Specialist

Develop the processes and procedures to effectively store and organize data.

Wide Data

Every data subject has a single row with multiple columns to hold values of various attributes of the subject.

Wide data is data where each row contains multiple data points for the particular items identified in the columns.

Database

A collection of data stored in a computer system.

Dataset

A collection of data that can be manipulated or analyzed as one unit.

Design Thinking

Design thinking is a non-linear, iterative process that teams use to understand users, challenge assumptions, redefine problems and create innovative solutions to prototype and test. Involving five phases—Empathize, Define, Ideate, Prototype and Test—it is most useful to tackle problems that are ill-defined or unknown.

Interaction Design Foundation

Dirty Data

Data that is incomplete, incorrect, or irrelevant to the problem you're trying to solve and can't be used in a meaningful way.

Discrete data

Data that is counted and has a limited number of values.

Documentation

The process of tracking changes, additions, deletions and errors involved in your data cleaning effort.

Conditional Formatting

A spreadsheet tool that changes how cells appear when values meet specific conditions.

Data Constraints

Accuracy

Degree to which the data conforms to the actual entity being measured or described.

Completeness

Degree to which the data contains all desired components or measures.

Consistency

Degree to which the data is repeatable from different points of entry or collection.

Cross-field validation

Certain conditions for multiple fields must be satisfied.

Data range

Values must fall between predefined maximum and minimum values.

Data type

Values must be of a certain type: date, number, percentage, Boolean, etc.

Foreign key

Databases only: Values for a column must unique values coming from a column in another table.

Mandatory

Values can't be left blank or empty.

Primary key

Databases only: Value must be unique per column.

Regular expression (regex) patterns

Values must match a prescribed pattern.

Set-membership

Databases only: Values for a column must come from a set of discrete values.

Unique

Values can't have a duplicate.

Data Design

How information is organized.

Data-Driven decision-making

Using facts to guide business strategy.

Data Ecosystem

The various elements that interact with one another to produce, manage, store, organize, analyze, and share data.

Data Element

A piece of information in a dataset.

Data Engineers

Transform data into a useful format for analysis and give it a reliable infrastructure.

Data Ethics

Well-founded standards of right & wrong that dictate how data is collected, shared, and used.

Data Governance

A process for ensuring the formal management of a company's assets.

Data-inspired decision-making

Exploring different data sources to find out what they have in common.

Data: Insufficient Data - Types of

  • Data from only one source

  • Data that keeps updating and is incomplete

  • Geographic limited data

  • No data

Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data.

If there isn’t time to collect data, perform the analysis using proxy data from other datasets - most common workaround.

Outdated data.

Do the analysis using proxy data along with actual data.

Adjust your analysis to align with the data you already have and not that in the report.

Wrong Data

If you have the wrong data because requirements were misunderstood, communicate the requirements again.

Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors.

If you can’t correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias.

Possible workarounds

Identify trends with the available data.

Wait for more data if time allows

Discuss with stakeholders and adjust your objective.

Search for a new dataset.

Data Interoperability

The ability to integrate data from multiple sources and a key factor in the successful use of open data among companies and governments.

Data Life Cycle

The sequence of stages that data experiences, which include plan, capture, manage, analyze archive, and destroy.

Long Data

Data in which each row is a one-time point per subject, so each subject will have data in multiple rows.

So instead of having a column for each year (or other data point) you have a record multiple times for (as an example) each year.

Data Mapping

The process of matching fields from one data source to another.Data Model

A tool for organizing data elements and how they relate to one another.

Data Model

A tool for organizing data elements and how they relate to one another.

Data Privacy

Preserving a data subject's information any time a data transaction occurs.

Data Remove Duplicates

A tool that automatically searches for and eliminates duplicate entries from a spreadsheet.

Data Science

A field of study that uses raw data to create new ways of modeling and understanding the unknown.

Data Security

Protecting data from unauthorized access or corruption by adopting safety measure.

Data Strategy

The management of the people, processes, and tools used in data analyses.

Data Type

An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform.

Data Validation

A tool for checking the accuracy and quality of data before adding or importing it.

Data Visualization

The graphical representation of data.

Junk Charts Trifecta Checkup: The Definitive Guide

What is the practical QUESTION?

What does the DATA say?

What does the VISUAL say?

The Trifecta Checkup framework establishes a taxonomy of data viz critique. There are eight types of critiques: each is presented with an icon, and accompanied by an example from a prior post on Junk Charts.

The trifecta

Everything is in sync, and the chart has no weaknesses.

The singles

Type Q

Some charts use a good source of data effectively presented in a visual display. However, the effort fails because of a poorly defined objective, or an unengaging premise.

Type D

Some designs emerge from well-posed and interesting questions, and the graph is well executed. The problem here is the data, which fail to illuminate the question. Typically, the data only tangentially concern the topic, or certain adjustments are wanting, or there is a quality concern.

Type V

Despite having a good data source and an interesting, well-posed problem, the visual design hides or confuses the message. These charts have long provided fodder for Tufte, Wainer and the like.

The Doubles

Type QD

The graphical elements follow best practices, and present the data well. This effort is in vain, because of poor data quality, and an unclear objective.

Type QV

The data has been properly collected and processed. However, the question being addressed has not been clearly defined, and the graphical design fails to bring out the key features of the data.

Type DV

An interesting question has been posed. The data fail to convince, and the cause is not helped by poor execution of the graphical elements.

Triple

These graphical disasters do not get anything right.

Type QDV

These graphical disasters do not get anything right.

Data Visualizations

Data visualization is the graphical representation of data. But why should data analysts care about data visualization? Well, your audience won’t always have the ability to interpret or understand the complex information that you relay to them, so your job is to inform them of your analysis in a way that is meaningful, engaging, and easy to understand. Part of why data visualization is so effective is because people’s eyes are drawn to colors, shapes, and patterns, which makes those visual elements perfect for telling a story that goes beyond just the numbers.

Four elements of successful data visualizations:

Information - Reflects the conclusion you've drawn from the data, which you communicate via the visualization.

Story - Adds meaning to the data and makes it interesting.

Goals - Makes the data usable and useful.

Visual form - creates both beauty and structure.

Source: What Makes a Good visualization.

Best Practices

The audience should know what they are looking at within 5 seconds, and the visualization should be clear and easy to follow.

In the next 5 seconds the audience should understand what the conclusion of the visualization even if they are unfamiliar with the data.

As long as the visualization is not misleading, only show the data that the audience needs to understand the findings.

Effective Data Visualizations

Pre-attentive attributes: marks and channels

Pre-attentive attributes are the elements of a data visualization that people recognize automatically without conscious effort. The essential, basic building blocks that make visuals immediately understandable are called marks and channels.

Marks

M​arks are basic visual objects like points, lines, and shapes. Every mark can be broken down into four qualities:

Position - Where a specific mark is in space relating to a scale or to other marks.

Size - How big, small, long, or tall a mark is.

Shape - Whether a specific object is given a shape that communicates something about it.

Color - What color the mark is.

Channels

C​hannels are visual aspects or variables that represent characteristics of the data. Channels are basically marks that have been used to visualize data. Channels will vary in terms of how effective they are at communicating data based on three elements:

Accuracy - Are the channels helpful in accurately estimating the values being represented?

Pop out - How easy is it to distinguish certain values from others?

Grouping - How good is a channel at communicating groups that exist in the data?

Remember: the more you emphasize different things, the less that emphasis counts. The more you emphasize one single thing, the more that counts.

Design principles

Choose the correct visual

One of the first things you have to decide is which visual will be the most effective for your audience. Sometimes, a simple table is the best visualization. Other times, you need a more complex visualization to illustrate your point.

Optimize the data-ink ratio

The data-ink entails focusing on the correct visual that is essential to understanding the point of the chart. Try to minimize non-data ink like boxes around legends or shadows to optimize the data-ink ratio.

Use orientation effectively

Make sure the written components of the visual, like the labels on a bar chart, are easy to read. You can change the orientation of your visual to make it easier to read and understand.

Color

There are numerous important considerations when considering using color in your visuals. These include using color consciously and meaningfully, staying consistent throughout your visuals, being considerate of what colors mean to different people, and using inclusive color scales that make sense for everyone viewing them.

Numbers of things

Think about how many elements you include in any visual. If your visualization uses lines, try to plot five or fewer. If that isn’t possible, use color or hue to emphasize important lines. Also, when using visuals like pie charts, try to keep the number of segments to less than seven since too many elements can be distracting.

Engage your audience

Data visualization can make complex (and even monotonous) information easily understood, and knowing how to utilize data visualization is a valuable skill to have. Your goal is always to help the audience have a conversation with the data so your visuals draw them into the conversation. This is especially true when you have to help your audience engage with a large amount of data, such as the flow of goods from one country to other parts of the world.

Reference

Below is a list of resources that can inspire your next data-driven decisions, as well as teach you how to make your data more accessible to your audience:

The Data Visualization Catalog - Dated Information with last post from 2014.

A project developed by Severino Ribecca to create a (non-code-based) library of different information visualization types. The website serves as a learning and inspiration resource for those working with data visualization.

Originally, this project was a way for me to develop my own knowledge of data visualization and create a reference tool for me to use in the future for my own work. However, I felt it would also be beneficial to both designers and also anyone in a field that requires the use of data visualization.

Each visualization method was added bit-by-bit, as I individually researched each method, to find the best way to explain how it works and what it is best suited for.

The 25 Best Data Visualizations of 2020

The 10 Best Data Visualization Blogs To Follow - Tableau

Data Studio Report Gallery - Google Data Studio

Data Warehouse Specialist

Develop the processes and procedures to effectively store and organize data.

Wide Data

Every data subject has a single row with multiple columns to hold values of various attributes of the subject.

Wide data is data where each row contains multiple data points for the particular items identified in the columns.

Database

A collection of data stored in a computer system.

Dataset

A collection of data that can be manipulated or analyzed as one unit.

Design Thinking

Design thinking is a non-linear, iterative process that teams use to understand users, challenge assumptions, redefine problems and create innovative solutions to prototype and test. Involving five phases—Empathize, Define, Ideate, Prototype and Test—it is most useful to tackle problems that are ill-defined or unknown.

Reference

Interaction Design Foundation

Dictionary

A collection of words on a particular subject(s) arranged alphabetically along with their meanings.

Dirty Data

Data that is incomplete, incorrect, or irrelevant to the problem you're trying to solve and can't be used in a meaningful way.

Duplicate Data

Any data record that shows up more than once.

Outdated Data

Any old data, which should be replaced with newer or more accurate data.

Incomplete Data

Any data that is missing important fields.

Incorrect / Inaccurate Data

Any data that is complete but inaccurate.

Inconsistent Data

Any data that uses different formats to represent the same thing,

Discrete data

Data that is counted and has a limited number of values.

Documentation

The process of tracking changes, additions, deletions and errors involved in your data cleaning effort.

E

Equation

A calculation that involves addition, subtraction, multiplication, or division AKA a math expression.

Estimated Response Rate (See Sample Size)

If you are running a survey of individuals, this is the percentage of people you expect will complete the survey out of those that received it.

Ethics

Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of right, obligations, benefits to society, fairness, or specific values.

Experimenter Bias

The tendency for different people to observe things differently, (see also Observer Bias).

External Bias

Data that lives and is generated outside an organization.

F

Fairness

A quality of data analysis that does not create or reinforce bias.

Field

A single piece of information from a row or column of a spreadsheet; in a data table, typically a column in the table.

Fill handle

A box in the lower right-hand corner of a selected spreadsheet cell that can be dragged through neighboring cells to continue an instruction.

Filtering

The process of showing only the data that meets specified criteria while hiding the rest.

Find & Replace

A tool that looks for a specified search term in a spreadsheet and allows you to replace it with something else.

First party data

Data collected by an individual or group using their own resources.

Float

A number that contains a decimal

Foreign key

A field within a database that is a primary key in another table (see also Primary Key)

Formula

A set of instructions used to perform a calculation using the data in a spreadsheet.

FROM

The section of a query that indicates where the selected data comes from.

Functions

A preset command that automatically performs a specified process or task using the data in a spreadsheet.

A set of instructions that performs a specific calculation using the data in a spreadsheet.

DATEDIF - Google

Calculates the number of days, months, or years between two dates.

DAYS360 - Excel

The DAYS360 function returns the number of days between two dates based on a 360-day year (twelve 30-day months).

VLOOKUP - Excel

VLOOKUP - Google

FILTER

LEN - Function

A function that tells you the length of a text string by counting the number of characters it contains.

LEFT

A function that gives you a set number of characters from the left side of a string.

Both LEFT & RIGHT count the characters from the LEFT or RIGHT side of the string. So if a value in the cell is a combination of numbers and letters but with a fixed length (12345AVBC) you can create a new column and have the number from either margin copied to the new column.

MID

A function that gives you a segment from the middle of a string ie pulls a sub string from the middle of a text string.

RIGHT

A function that gives you a set number of characters from the rights side of a string.

Both LEFT & RIGHT count the characters from the LEFT or RIGHT side of the string. So if a value in the cell is a combination of numbers and letters but with a fixed length (12345AVBC) you can create a new column and have the number from either margin copied to the new column.

TRIM

A function that removes leading, trailing, and repeated spaces in data.

VLOOKUP

A function that searches for a certain value in a column to return a corresponding piece of information.

G

Gap analysis

A method for examining and evaluating the current state of a process to identify opportunities for improvement in the future.

General Data Protection Regulation of the European Union (GDPR)

Policymaking body in the European Union, created to help protect people and their data.

Geolocation

The geographical location of a person or device using digital information.

Glossary

Collection of terms with their definitions.

Good Data Source

A data source that is reliable, original, comprehensive, current and cited.

Reference

Google Sites

Keyboard Shortcuts - Google

Keyboard Shortcuts - Microsoft

H

Header

The first row in a spreadsheet that labels the type of data in each column.

Hypothesis Testing

A way to see if a survey or experiment has meaningful results.

I

Imports data at a given url in .csv (comma-separated value) or .tsv (tab-separated value) format.

Imports a RSS or ATOM feed.

Imports data from a table or list within an HTML page.

Imports a range of cells from a specified spreadsheet.

Imports data from any of various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.

Internal Data

Data that lives within a company's own systems.

Interpretation Bias

The tendency to interpret ambiguous situations in a positive or negative way.

L

LENGTH() / LEN()

Return the length of a string of text by counting the number of characters it contains.

Lexicon

An arrangement of words and their definition and organized alphabetically.

M

The margin of error is a statistic expressing the amount of random sampling error in the results of a survey. The larger the margin of error, the less confidence one should have that a poll result would reflect the result of a survey of the entire population. The margin of error will be positive whenever a population is incompletely sampled and the outcome measure has positive variance, which is to say, the measure varies.

  • The maximum amount that the sample results were expected to differ from those of the actual population.

  • To calculate the margin of error, you need the population, sample size and confidence level.

  • The maximum amount that the sample results were expected to differ from those of the actual population. More technically, the margin of error defines a range of values below and above the average result for the sample. The average result for the entire population is expected to be within that range. We can better understand margin of error by using some examples below.

  • Confidence level: A percentage indicating how likely your sample accurately reflects the greater population

  • Population: The total number you pull your sample from

  • Sample: A part of a population that is representative of the population

  • Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population

In most cases, a 90% or 95% confidence level is used. But, depending on your industry, you might want to set a stricter confidence level. A 99% confidence level is reasonable in some industries, such as the pharmaceutical industry.

Margin of Error in Marketing

A/B testing (or split testing) tests two variations of the same web page to determine which page is more successful in attracting user traffic and generating revenue. User traffic that gets monetized is known as the conversion rate.

A/B testing allows marketers to test emails, ads, and landing pages to find the data behind what is working and what isn’t working. Marketers use the confidence interval (determined by the conversion rate and the margin of error) to understand the results.

Reference - Sample Size

Good Calculators Sample Size Margin of Error

CheckMarket Sample Size Margin of Error

The McCandless Method of Data Presentation

  • Introduce the graphic by its name

  • Answer the obvious before being asked

  • State the insight of your graphic

  • Call out data to support the insight

  • Close and transition to next point

https://artscience.blog/home/the-mccandless-method-of-data-presentation

Reference - Data Viz

Information is Beautiful (Main Site)

Founded by David McCandless, author of two bestselling infographics books, Information is Beautiful is dedicated to helping you make clearer, more informed decisions about the world. All our visualizations are based on facts and data: constantly updated, revised and revisioned.

Information is Beautiful - What Makes a Good Visualization (presentation)

Definition of a good viz.

Online Seminars

The beauty of data visualization (TED)

David McCandless turns complex data sets (like worldwide military spending, media buzz, Facebook status updates) into beautiful, simple diagrams that tease out unseen patterns and connections. Good design, he suggests, is the best way to navigate information glut -- and it may just change the way we see the world.

McCandless Method of Data Visualization (Blog)

A DATA (AND OTHER THINGS) BLOG

News Daily & Examples

In this McCandless collection, explore uplifting trends and statistics that are beautifully visualized for your creative enjoyment. A new chart is released every day so be sure to visit often to absorb the amazing things happening all over the world.

The Wall Street Journal Guide to Information Graphics: The Dos and Don'ts of Presenting Data, Facts, and Figures (Link to Amazon book - 2010)

Stanley McCandless

https://en.wikipedia.org/wiki/Stanley_McCandless

Merger

An agreement that unites two organizations into a single one.

Descriptive Metadata

Metadata that describes a piece of data and can be used to identify it at a later time.

N

Null

An indication that a value does not exist in a dataset.

P

Digital photo

An electronic or computer-based image, usually in BMP or JPEG format.

Pivot Table

A tool that is used when cleaning data and is a data summarization tool that is used in data processing.

Population

All possible data values in a certain dataset.

The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.

Population

All possible data values in a certain dataset.

The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.

Primary Key

References a column in which each value is unique.

Q

R

Relational Database

A database that contains a series of tables that can be connected to form relationships.

Remove Duplicates

A tool used when cleaning data, and it automatically searches for and eliminates duplicate entries from a spreadsheet.

S

Sample

A part of the population that is representative of the population.

A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population.

Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.

The confidence level most commonly used is 95%, but 90% can work in some cases.

Bias - Sampling

A sample isn't representative of the population as a whole.

In probability theory, the central limit theorem (CLT) states that the distribution of a sample variable approximates a normal distribution (i.e., a “bell curve”) as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population's actual distribution shape.

See link above for more info on: normal distribution, variance, law of large numbers.

The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population's distribution.

Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold.

A key aspect of CLT is that the average of the sample means and standard deviations will equal the population mean and standard deviation.

A sufficiently large sample size can predict the characteristics of a population more accurately.

Reference

Central Limit Theorem - Investopedia

Sample Size Formula - Complete Dissertation

Confidence Interval - Sample

The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.

Confidence Level - Sample

How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study.

The probability that your sample size accurately reflects the greater population.

Increase the sample size to meet specific needs of your project:

  • For a higher confidence level, use a larger sample size.

  • To decrease the margin of error, use a larger sample size.

  • For greater statistical significance, use a larger sample size.

Sample Size

A part of the population that is representative of the population.

Estimated Response Rate

If you are running a survey of individuals, this is the percentage of people you expect will complete the survey out of those who received the survey.

Confidence Level

The probability of your sample size accurately reflects the greater population.

Margin of Error - Sample

Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population.

Population - Sample

Entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.

The total number you hope to pull your sample from.

Random Sampling

A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen.

Schema

A way of describing how something is organized.

SELECT

SELECT FROM

Pull data from any table in a database.

SELECT FROM WHERE

Pull data from a specific place in a table, typically a column.

Sorting

Arranging data into a meaningful order to make it easier to understand, analyze and visualize.

Split

A tool that divides text around a specified character and puts each fragment in a new, different cell.

Spreadsheets

Generated with a program.

Access to the data you input.

Stored locally.

Small datasets.

Working independently.

Built-in functionalities.

Statistical Power

The probability of getting meaningful results from a test.

The statistical power of a study (sometimes called sensitivity) is how likely the study is to distinguish an actual effect from one of chance. It’s the likelihood that the test is correctly rejecting the null hypothesis (i.e. “proving” your hypothesis). For example, a study that has an 80% power means that the study has an 80% chance of the test having significant results.

Usually, you need a statistical power of at least zero point 8 (80%) to consider your results statistically significant.

A high statistical power means that the test results are likely valid. As the power increases, the probability of making a Type II error decreases.

A low statistical power means that the test results are questionable.

Statistical power can be calculated and reported for a completed experiment to comment on the confidence one might have in the conclusions drawn from the results of the study. It can also be used as a tool to estimate the number of observations or sample size required to detect an effect in an experiment.

Reference

A Gentle Introduction to Statistical Power and Power Analysis in Python

A Gentle Introduction to Statistical Hypothesis Testing

Statistics - Basic Probability Theorem

Probability theory is the mathematical framework that allows us to analyze chance events in a logically sound manner. The probability of an event is a number indicating how likely that event will occur. This number is always between 0 and 1, where 0 indicates impossibility and 1 indicates certainty.

Reference

Seeing Theory - Brown

Statistical Significance - Sample

The determination of whether your result could be due to chance or not. The greater the significance, the less due to chance.

If a test is statistically significant, it means the results of the test are real and not an error caused by chance.

Statistical Type II Error

A type II error is a statistical term used within the context of hypothesis testing that describes the error that occurs when one fails to reject a null hypothesis that is actually false. A type II error produces a false negative, also known as an error of omission.

Structured Query Language (SQL)

A language to interact with database programs.

Can pull info from different sources in a database.

Stored across a database.

Larger datasets.

Track changes across a team.

Useful across multiple programs.

Sub-string

A sub set of a text string.

SUBSTR()

Return a limited number of characters to create substrings from longer strings of text.

Syntax

A predetermined structure that includes all required information and its proper placement.

T

Text String

A group of characters within a cell, commonly composed of letters, numbers or both.

TRIM

A tool when cleaning data A function that removes leading, trailing, and repeated spaces in data.

U

V

Verification

A process to confirm that a data-cleaning effort was well executed, and the resulting data is accurate and reliable.

VLOOKUP

Vertical lookup. Searches down the first column of a range for a key and returns the value of a specified cell in the row found.

A function that searches for a certain value in a column to return a corresponding piece of information.

=VLOOKUP(data to lookup,'where to look'!Range, column, false)

Reference

VLOOKUP - Google

W

X

Y

Z

Reference

Automation Analysis

Automating Scientific Data Analysis Part 1 - towardsdatascience

Automating big-data analysis - MIT News

10 of the Best Options for Workflow Automation Software - Technology Advice

Causation & Correlation

Towards Data Science

Correlation and Causation | Lesson - Khan Academy

Cleaning Data

Guide To Data Cleaning: Definition, Benefits, Components, And How To Clean Your Data - Tableau

Top ten ways to clean your data - Microsoft

10 Google Workspace tips to clean up data - Google

Data - Open & Public

Is There a Difference Between Open Data and Public Data?

Kaggl Open data sets

https://www.kaggle.com/datasets

https://www.kaggle.com/sakshigoyal7/credit-card-customers

https://www.kaggle.com/datasnaek/youtube-new

https://www.kaggle.com/rtatman/188-million-us-wildfires

https://www.kaggle.com/bigquery/google-analytics-sample

https://www.kaggle.com/docs/datasets

Data Visualization

The 25 Best Data Visualizations of 2020

The 10 Best Data Visualization Blogs To Follow - Tableau

Data Studio Report Gallery - Google Data Studio

General List of Help Sites

Welcome to the Google Workspace Learning Center - Google

10 Google Workspace tips for finance - Google

Margin of Error

Margin of Error Calculator 1

Margin of Error Calculator 2

Margin of Error Calculator 3

Margin of Error Calculator 4 (Sheet)

McCandless Method

Information is Beautiful (Main Site)

Founded by David McCandless, author of two bestselling infographics books, Information is Beautiful is dedicated to helping you make clearer, more informed decisions about the world. All our visualizations are based on facts and data: constantly updated, revised and revisioned.

Information is Beautiful - What Makes a Good Visualization (presentation)

Definition of a good viz.

Online Seminars

The beauty of data visualization (TED)

David McCandless turns complex data sets (like worldwide military spending, media buzz, Facebook status updates) into beautiful, simple diagrams that tease out unseen patterns and connections. Good design, he suggests, is the best way to navigate information glut -- and it may just change the way we see the world.

McCandless Method of Data Visualization (Blog)

A DATA (AND OTHER THINGS) BLOG

News Daily & Examples

In this McCandless collection, explore uplifting trends and statistics that are beautifully visualized for your creative enjoyment. A new chart is released every day so be sure to visit often to absorb the amazing things happening all over the world.

The Wall Street Journal Guide to Information Graphics: The Dos and Don'ts of Presenting Data, Facts, and Figures (Link to Amazon book - 2010)

Sample Size

Sample Size Calculator 1

Sample Size Calculator 2

Sample Size Calculator 3 (Sheet)

Sample Size Calculator 4

A Gentle Introduction to Statistical Power and Power Analysis in Python

VLOOKUP

VLOOKUP - Google

VLOOKUP function - Microsoft