Introduction
We are often dealing with large amounts of data. Hence, it is best to sample our data before drawing inferences. One of the most widespread method to create samples is by using simple random sampling. Today, we will discuss ways to accomplish that in python. We will be creating random samples from sequences in python but also in pandas.dataframe
object which is handy for data science.
Objectives
- Learn how to sample data from a python class like list, tuple, string, and set.
- Learn how to sample data from Pandas DataFrame.
What is random sample?
A random sample means just as it sounds. We are creating a sample from our data set. Each item in the sample is selected completely randomly. It is easiest and most useful way to sample our data set.
Why random sample?
Generating a random sample is practically easy and it also helps us avoid selection bias. When we are sampling a data set, we must make sure that our sample has all the qualities necessary to be a representative of the whole data set. A simple random sample is the easiest and the best way to achieve the same. It is also easy to produce another random sample from the same data set for comparison.
Creating random samples from list, tuple, string, or set
In python, list, tuple, and string are treated as a sequence of data. That means we can use an index to access its values. A set can store multiple values but there is no proper order and the values cannot be repeated. We can use all four data types to generate a sample using random.sample()
method.
Creating random sample from a list of numbers
STEP 1: Importing random
library
import random
STEP 2: Creating a list
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
STEP 3: Generating a random number
random_number = random.sample(numbers, 1)
print(random_number)
Output:
[2]
We can change the value 1
inside the random.sample()
to our desired amount. If we put 2
, it will output 2 random numbers from the list as shown below.
import random
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
random_number = random.sample(numbers, 2)
print(random_number)
Output:
[10, 8]
We can replace the list a with a tuple or a set and steps will be the same.
import random
# declaring list, tuple, and set
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
numbers_tuple = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
numbers_set = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
# selecting a random number
random_number = random.sample(numbers, 1)
random_number_tuple = random.sample(numbers_tuple, 1)
random_number_set = random.sample(numbers_set, 1)
# printing the output
print(random_number)
print(random_number_tuple)
print(random_number_set)
Output:
[3]
[8]
[7]
Please note: If we are using python 3.9 and we’re using random.sample()
method on a set, we will get a deprecation warning. It still works but it is not a recommended practice. The method will probably no more work for sets in future python versions.
Creating random sample from a string
The step by step process is pretty much the same. The code will look as follows.
import random
name = "ABRAHAM LINCOLN"
random_letter = random.sample(name, 1)
print(random_letter)
Output:
['H']
Handling exceptions
If we put a sample size that is greater than the size of the sequence (or a negative number), it will result in a traceback. For example.
import random
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
random_number = random.sample(numbers, 11)
print(random_number)
ValueError: Sample larger than population or is negative
Remember we are creating samples in order to understand a larger population. So it makes no sense to have a sample size greater than the population itself. This sounds like a silly mistake, but it can cause our program to crash. We can use try/except
to avoid it.
import random
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
try:
random_number = random.sample(numbers, 11)
print(random_number)
except:
print("The given sample size is bigger than the population or the number is negative")
Output:
The given sample size is bigger than the population or the number is negative
Creating random samples from Pandas Dataframe
If we’re working on a data science project, chances are we are required to create a random sample from a pandas dataframe. Just like Python, Pandas library has also made it simple with an inbuilt sampling method.
STEP 1: Open the command prompt and install pandas
pip install pandas
STEP 2: Importing pandas
import pandas as pd
STEP 3: Creating a dictionary for building our dataframe
students = {'Name':['Tony', 'Jake', 'Sullivan', 'Peter', 'Emma'],
'Age': [14, 14, 15, 17, 16],
'Favorite subject': ['Maths', 'General Science', 'Social Studies', 'English Literature', 'Computer Science'],
'Hobby': ['Writing', 'Gardening', 'Coin collection', 'Reading', 'Programming']}
STEP 4: Converting to dataframe
df = pd.DataFrame(students)
print(df)
Output:
Name Age Favorite subject Hobby
0 Tony 14 Maths Writing
1 Jake 14 General Science Gardening
2 Sullivan 15 Social Studies Coin collection
3 Peter 17 English Literature Reading
4 Emma 16 Computer Science Programming
Now we have our dataframe set. What we need to is pick a random sample. It is actually surprising how easy it is.
STEP 4: Creating random sample
random_sample = df.sample()
print(random_sample)
Output:
Name Age Favorite subject Hobby
2 Sullivan 15 Social Studies Coin collection
Simply writing .sample()
will pick a single random row and output. To get a fixed number of rows, do the following.
random_sample = df.sample(n=3)
print(random_sample)
Output:
Name Age Favorite subject Hobby
3 Peter 17 English Literature Reading
0 Tony 14 Maths Writing
2 Sullivan 15 Social Studies Coin collection
Here it outputs 3 rows. We can also express in terms of fraction. In the below example, we use 30% of our data as sample.
random_sample = df.sample(frac=0.3)
print(random_sample)
Name Age Favorite subject Hobby
3 Peter 17 English Literature Reading
1 Jake 14 General Science Gardening
PS: If we’re dealing with .csv file, we can use pd.read_csv()
and it will directly convert to a dataframe object. Then we can employ the same steps above.
Conclusion
In this post, we discussed what is simple random sampling and how it is useful in the data science world. We also discussed how to create simple random sample for sequences of data and for the pandas.dataframe
object. We also learned to handle any necessary exceptions.