Introduction
A histogram is one of the 7 basic tools for quality control. Histograms also figure prominently in the data visualization world. For a small data set, histograms should be easy to plot physically. We can also use a tool like MS Excel to plot histograms. However, we are going to plot it the cool way - using python. The module we will be using is called matplotlib. It is currently the most popular module for data visualization but it is not the only module out there. There are more modern modules like altair and seaborn. But as it stands today, matplotlib remains the popular choice.
In this post, we will learn to plot a histogram using python dictionary.
Objectives
- Gain a brief understanding of data visualization and techniques.
- Learn about histograms, how to read them, and how to create them.
- Understand python dictionaries, defaultdict and counter().
- Learn to create a typical bar chart using the python matplotlib module.
- Learn to create histogram using python matplotlib module.
What is data visualization?
Data visualization is the graphical representation of information and data. It has many tools including but not limited to charts, graphs, and maps. Data visualization has gained wide prominence in the last decade due to the rise of big data. The easiest way to understand trends, outliers, and patterns in data is to graphically represent them. In a world that is increasingly making data-driven decisions, data visualization methods have proved to be very effective. One can say data visualization has mainly 2 uses.
- To explore/understand the given data
- To communicate the given data
What are some data visualization techniques?
Either discussing data visualization or visualization techniques will require writing an entire book. But we will go through a crude list of visualization techniques. These are easy to learn and something we probably came across during school.
- Area chart
- Bar chart
- Heat map
- Histogram
- Pie chart
- Scatterplot
Data visualization is a fast-growing field. The techniques are always evolving. The above list is very crude but is a starting point if one is looking to learn data visualization. In this article, we will be specifically looking to discuss histograms.
What is a histogram?
A histogram is used to graphically represent the distribution of numerical data. In other words, it is used to visualize the frequency of a particular data in a given range. One must have heard of (or at least seen one) color histograms. A color histogram is a representation of the distribution of colors in an image. Anyone that ever tried photo editing must have at least unknowingly came across one.
Histograms were invented by the English mathematician and biostatistician, Karl Pearson. He is also by and large credited with establishing the field of mathematical statistics. The general step-by-step method to plot histograms is as follows:
- Divide the given range of values into defined intervals. These intervals are called bins or buckets.
- Now we count how many values fall into each bin or the bucket.
- The range of values is marked across the x-axis.
- The counts are marked across the y-axis.
- A bar graph is plotted.
Things to note:
- The interval should be non-overlapping. The entire purpose of the histogram is to visualize the frequency of data in the given interval.
- The interval is usually equal but it does not have to be.
- The intervals chosen should be adjacent.
PS: The difference between a histogram and a bar chart is that a histogram represents continuous data. In a bar chart, each bar is dedicated to a special category. We usually give space between the bars in a bar chart which is absent in a histogram. The space between the bars should immediately communicate whether we are looking at a bar chart or a histogram.
Reading a histogram
Since histograms are visual tools, reading them is an easy task as they’re supposed to be. We already mentioned that histograms usually have equal intervals but it is not necessary. There are two ways of reading a histogram depending on the interval selection.
If the histogram has equal intervals, then each bar can be directly compared with the other.
- The height of the bar is directly proportional to the frequency of the data point.
- The y-axis represents the frequency of the data.
- The x-axis represents the range.
If the histogram has unequal intervals, then each bar cannot be directly compared with the other.
- In this case, the area of the bar is proportional to the frequency of the data. The range (width) is always fixed. So the area is determined by height which in turn is determined by the frequency of the data in that given interval.
- The y-axis represents frequency density as opposed to the frequency in the earlier case. This is because the y-axis represents the number of data points per unit of variable in the x-axis.
- The x-axis represents the range.
Histograms have 5 patterns that we should be looking out for.
- Unimodal: Here the distribution has 1 peak.
- Skewed right: Here frequency peaks at the left side and calms down to the right.
- Skewed left: Here the frequency is calm at the left and peaks to the right.
- Bimodal: Here the distribution has 2 peaks.
- Multimodal: A multimodal distribution has 3 or more peaks.
We should also try to change the interval period to see different patterns arise. Plotting over different intervals can give valuable information.
Applications of histogram
Histograms have many applications. We have listed a few.
- Histograms are used in digital image processing. Histograms are used to show the distribution of image contrast or brightness. Histogram tools allow us to adjust them in popular image processors including photoshop.
- Histograms are used in hydrology to analyze rainfall frequency.
- We can check the distribution type of our data by constructing a histogram.
- Histograms are useful tools for spotting deviations in our data.
Constructing a histogram
The best way to understand something is to make it. But before we show how to make one using python, let us understand how it is physically made. Let’s suppose we have 10 students who have recently undergone a test. All students are having different marks. We want to understand the distribution of marks of these students so we can have a better understanding of how much the students are scoring. The marks are as follows:
marks = [6, 95, 82, 82, 73, 8, 75, 82, 99, 67]
The histogram will look like this. We made it using Matplotlib, which we will get to later.
We divided the whole marks into intervals of 10. Then we count how much the students scored in a given interval. The number of students are marked on the y-axis and determines the height of the bar. The range itself is marked on the x-axis. By looking at the histogram, we can easily see that most students (3 out of 10) scored between 80 and 90 marks. We can also see 80% of students scored above 60. See how a simple histogram made data visualization possible.
Python dictionaries
Dictionaries are one of the fundamental data structures in python. It stores key-value pairs. For us to plot a histogram, it is necessary to split our data into key-value pairs. By looking at the above graph, we can see that our keys will be the ranges in the x-axis while our values are on the y-axis. In python, dictionaries are the only way to create key-value pairs. So it is important to understand python dictionaries before going forward.
There are few ways to declare a dictionary, but the best way is the pythonic way.
our_dict = {}
The above syntax will declare an empty dictionary called our_dict
. But in addition to that, there are few more ways we can declare a dictionary.
# creating an instance of the class dictionary
our_dict = dict()
# creating a dictionary literal
marks = {"Alan": 92, "Turing": 88}
But we will prefer the pythonic way of doing things. Now let’s understand how to add key and values in to a dictionary.
marks = {}
marks["Alan"] = 92
marks["Turing"] = 88
print(marks)
# {'Alan': 92, 'Turing': 88}
Here we manually inserted the key-value pair into the dictionary. However, that is not possible while making a histogram. We need to able to automatically count the values in the given interval(keys). To understand that, let’s first write a program that will estimate the frequency of words in a document. This is obviously achieved through a dictionary where the word will be the key while count will be the value. In this case, we will have to write a program that checks each word in the document and do two things:
- If the word is not already in the dictionary, add the word as a key and value as 1.
- If the word is already in the dictionary, just add 1 to the existing value.
This can be achieved using a simple for-loop.
word_counts = {}
for word in document:
if word in word_counts:
word_counts[word] += 1
else:
word_counts[word]
While the above code will completely do the job for us, python has a better way of doing it using defaultdict
.
defaultdict
defaultdict
is just like a regular dictionary in python but if we ever looked for a key in defaultdict
which is not present there, then it will automatically add that key to the dictionary. The corresponding value will be added based on the argument we passed to the defaultdict
. Let’s rewrite the above program using defaultdict
.
from collections import defaultdict
word_counts = defaultdict(int)
for word in document:
word_counts[word] += 1
Here we declared word_counts
as a defaultdict
with the argument int
. That means we have declared that the values will be integers. Here the int()
actually starts with 0 hence we add 1 to the value during the for loop. If the word was already in the dictionary, then it will simply add 1 to the existing value. We can also pass list
or dict
as arguments in defaultdict
. The value of the corresponding key will be a list or a normal python dictionary accordingly.
Counter()
Counter()
turns a sequence of values into a dictionary by mapping keys to counts. In other words, Counter()
will help us create an object that will kind-off look like a defaultdict
with int
as argument. Let’s go through a simple example to understand Counter()
.
from collections import Counter
list1 = [1, 2, 1, 2, 3, 4, 5, 2, 3, 4, 5, 5, 1, 2]
counts = Counter(list1)
print(counts)
# Counter({2: 4, 1: 3, 5: 3, 3: 2, 4: 2})
As we can see, Counter()
counts each item in the list. The item in the list is automatically added as the key while the value is the total count of that particular item. Instead of list1
, we can find a way to provide the intervals as the argument and the Counter()
will count the total number of values in that interval.
Creating a bar graph using python (matplotlib)
A histogram is made from a bar graph except the values are continuous for a histogram. In a bar graph, they’re discrete. Before we learn to create a histogram, let’s first learn to create a typical bar graph. This will help ease the process of understanding the creation of a histogram.
Creating a bar graph using matplotlib is pretty simple. Here we are making a bar graph to represent how many trophies the top 5 students got during their final year. The students are Ardelia Connery, Chester Clare, Luciano Bartle, Kacey Felker, and Tosha Tuel (yes we used a random name generator). The corresponding amount of trophies they received in the final year are 3, 2, 8, 7, and 3.
Now let’s learn to represent them in a bar graph step-by-step. One can find the complete code below.
STEP 1: Import the pyplot
method from matplotlib
from matplotlib import pyplot as plt
STEP 2: Making a list of names and trophy count
students = ["Ardelia Connery", "Chester Clare", "Luciano Bartle", "Kacey Felker", "Tosha Tuel"]
trophies = [3, 2, 8, 7, 3]
STEP 3: Passing the values to plt.bar
function
plt.bar(range(len(students)), trophies)
The syntax of plt.bar
is plt.bar(x, y)
. Here x is range(len(students))
and y is list trophies
.
The len(students)
outputs the total length of the list. The range()
creates a list of numbers starting from zero with the total number of items equal to the length of students
list. The length of students
list is 5, therefore the list generated is [0, 1, 2, 3, 4].
Hence there will be 5 bars marked from 0 to 4.
The values of the y-axis are taken from the list trophies
.
STEP 4: Marking the title and label along the y-axis
plt.title("Students With Most Number of Trophies")
plt.ylabel("No. of Trophies"
STEP 5: Marking x-axis with the names of students
plt.xticks(range(len(students)), students)
The above code will replace the name of the bars with the name of the students. Without this code, the bars will be named 0, 1 to 4.
STEP 6: Simply call plt.show()
plt.show()
Here is the complete program.
from matplotlib import pyplot as plt
students = ["Ardelia Connery", "Chester Clare", "Luciano Bartle", "Kacey Felker", "Tosha Tuel"]
trophies = [3, 2, 8, 7, 3]
plt.bar(range(len(students)), trophies)
plt.title("Students With Most Number of Trophies")
plt.ylabel("No. of Trophies")
plt.xticks(range(len(students)), students)
plt.show()
Creating a histogram using python (matplotlib)
Finally, it is time to discuss how to plot a histogram. We already showed a problem statement and its histogram in the beginning. Now we will show how to make it using python & matplotlib. Let’s quote the statement from above.
Let’s suppose we have 10 students who have recently undergone a test. All students are having different marks. We want to understand the distribution of marks of these students so we can have a better understanding of how much the students are scoring. The marks are as follows:
marks = [6, 95, 82, 82, 73, 8, 75, 82, 99, 67]
Again we will be following a step by step explanation. One can find the complete program at the end of it.
STEP 1: Importing the libraries
from collections import Counter
from matplotlib import pyplot as plt
STEP 2: Creating the list of marks
marks = [6, 95, 82, 82, 73, 8, 75, 82, 99, 67]]
STEP 3: Creating the histogram dictionary
histogram = Counter(min(mark // 10 * 10, 90) for mark in marks)
As we already showed above, Counter()
will go through a list and make a dictionary with the list item as the key and count as the value.
The for loop access each item in the marks
list and stores them in the variable mark
. The //
operator represents floor division. That means, value in the mark
variable is divided by 10 and rounded down to the nearest whole number. Then it is multiplied by 10 again.
For example, if it is the element 99, after floor division with 10, it will become 9 (9.9 rounded down to 9). Then it is multiplied by 10 to become 90. Hence every mark will be rounded down to the nearest whole number to put the marks in the same class.
min()
function chooses the lower of the two values. We can see one of the values is fixed at 90. The min()
just exist in case if a student happens to score 100. After floor division, the value will still be 100. The min()
will help us put that along with the rest of the marks in the 90s range.
STEP 4: Plotting the bars
plt.bar([x + 5 for x in histogram.keys()], histogram.values(), 10, edgecolor=(0, 0, 0))
The first parameter represents the values in the x-axis. histogram.keys()
will give us the keys from the histogram dictionary. Each value is added 5 to shift the bars by 5 units to the right. This is done for visual purposes.
The second parameter represents the values in the y-axis. histogram.values()
will give us the count.
10 is the width of the bars. Hence our bars have a width of 10 units.
Edge color is black as represented by 3 zeroes. This is done to create a visual distinction between the bars.
STEP 5: Defining the axis
plt.axis([-5, 105, 0, 5])
The x-axis will go from -5 to 105. This will leave 5 units at the beginning and end of the x-axis. Now the bars and axis will be in sync as bars are already shifted to the right by 5 units. Hence it will start at 0.
The y-axis will go from 0 to 5
STEP 6: Marking the x-axis
plt.xticks([10 * i for i in range(11)])
The x-axis will be marked from 0 to 100.
STEP 7: Labeling
plt.xlabel("Mark Range")
plt.ylabel("Number of Students")
plt.title("Distribution of Test Marks")
STEP 8: Calling plt.show()
plt.show()
Here is the full code.
from collections import Counter
from matplotlib import pyplot as plt
marks = [6, 95, 82, 82, 73, 8, 75, 82, 99, 67]
histogram = Counter(min(mark // 10 * 10, 90) for mark in marks)
plt.bar([x+5 for x in histogram.keys()], histogram.values(), 10, edgecolor=(0, 0, 0))
plt.axis([-5, 105, 0, 5])
plt.xticks([10 * i for i in range(11)])
plt.xlabel("Mark Range")
plt.ylabel("Number of Students")
plt.title("Distribution of Test Marks")
plt.show()
Disadvantages of a histogram
Now that we have completed our tutorial, let’s conclude it by discussing some disadvantages that come with histograms.
- The obvious one is that we can only use continuous data. Any break in our data set and we will have to depend on other methods.
- No two types of data can be compared.
- It is hard to extract the original input from the histogram as data gets placed in intervals and the original data point is lost.
- Histograms can be corrupt if there is a time difference in our data set. For example, our data collected using modern technology might be far more accurate than something collected 40 years ago. We can still plot a histogram all through the 40-years but the accuracy varies throughout hence corrupting our histogram.
- Given a large and complex set of data, predicting the most accurate intervals will prove to be a difficult task.