In statistical analysis, using weights to increase or decrease the relative importance of an item in a population is common. In real life, this has much application, particularly when calculating a weighted average. In this post, we will explore the concept and idea behind weights and also how to implement them using a pandas dataframe object.
Weight & Weighting factor
A weight or statistical weight is the amount given to increase or decrease the relative importance of an item in a data set. When it is assigned to a particular data point, it is called the weighting factor of that data point.
A weighting factor is just a fractional number that will affect the probability of that particular item being chosen. The sum of all probabilities is 1. Hence, the sum of all the weights should equal 1.
Weighting function
A weighting function allows us to allocate more weight for some data points compared to the other. This is handy when we know one part of our data set might be more accurate than the other.
For example, if we were to make a sample of the population of our country, we might have an equal probability of choosing a male or a female. But we know for a fact that there is only 49% male while there is 51% female. So if we choose a random sample with an equal probability of selecting either a male or a female, our sample will not represent the real population. For that, we will give 49% weight to the male class and 51% weight to the female class. Now we will be creating a sample that is by rule representative of the original population.
If f(a)
is the function that makes the sample of the population and w(a)
is the weighting function, then f(a)
x w(a)
will return the weighted sample.
Creating weighted samples using pandas.DataFrame
It is a simple task to create a weighted sample in pandas. The inbuilt .sample()
accepts weights
as a parameter that will automatically assign weights to the corresponding item. here is a sample program.
import pandas as pd
weightage = [0.1, 0.1, 0.2, 0.2, 0.4]
students = {'Name':['Tony', 'Jake', 'Sullivan', 'Peter', 'Emma'],
'Age': [14, 14, 15, 17, 16],
'Favorite subject': ['Maths', 'General Science', 'Social Studies', 'English Literature', 'Computer Science'],
'Hobby': ['Writing', 'Gardening', 'Coin collection', 'Reading', 'Programming']}
df = pd.DataFrame(students)
weighted_sample = df.sample(weights=weightage)
print(weighted_sample)
Output:
Name Age Favorite subject Hobby
4 Emma 16 Computer Science Programming
By default, .sample()
will only output one row. We can see that 40% weightage was given to the to the last row and it was selected. Now will try increasing the number of rows.
import pandas as pd
weightage = [0.1, 0.1, 0.2, 0.2, 0.4]
students = {'Name':['Tony', 'Jake', 'Sullivan', 'Peter', 'Emma'],
'Age': [14, 14, 15, 17, 16],
'Favorite subject': ['Maths', 'General Science', 'Social Studies', 'English Literature', 'Computer Science'],
'Hobby': ['Writing', 'Gardening', 'Coin collection', 'Reading', 'Programming']}
df = pd.DataFrame(students)
weighted_sample = df.sample(n=3, weights=weightage)
print(weighted_sample)
Output:
Name Age Favorite subject Hobby
4 Emma 16 Computer Science Programming
2 Sullivan 15 Social Studies Coin collection
3 Peter 17 English Literature Reading
The above output is not a given. Since these three values had 40%, 20%, and 20% chance, they were more likely to be selected over the first two values. But it’s only a matter of probability. We can run the program multiple times and get opposing results.
Using column as weights
The above example might be useful for a short set of data where we can define weights of each item corresponding to their index. But normally We may be required to have weight over some particular feature of our data. In the below example, we use age to create our sample.
import pandas as pd
students = {'Name':['Tony', 'Jake', 'Sullivan', 'Peter', 'Emma'],
'Age': [14, 14, 15, 17, 16],
'Favorite subject': ['Maths', 'General Science', 'Social Studies', 'English Literature', 'Computer Science'],
'Hobby': ['Writing', 'Gardening', 'Coin collection', 'Reading', 'Programming']}
df = pd.DataFrame(students)
weighted_sample = df.sample(n=3, weights="Age")
print(weighted_sample)
In this case, students with higher age will be more likely to be sampled than the other.
Drawback
If there is no weight, all items have an equal probability of being selected. But if the weights are present, the probability is changed. Weighting is done by repeating the number of rows. Rows with more weight will be repeated more than the ones with less weight. That comes with a major drawback.
Rows can only be repeated a definite number of times. Probabilities are fractional. So, the weights we give are rounded to a proper integer that describes the number of times the row is repeated. This will create some error where probability can be greater or smaller than the given weight depending on how it is rounded.