Sam cleaning the Citadel (GoT) for 10 hours with Python
As a fan of Game of Thrones, I couldn’t wait for it to return for a 7th season. Watching the season premier, I greatly enjoyed that iconic scene of Sam doing his chores at the Citadel.
I enjoyed it so much that I wanted to see more of it… much more of it.
In this post we’ll take the short video compilation of Sam cleaning the Citadel, we will split it to multiple sub clips and create a video of Sam cleaning the Citadel using a random mix of those sub clips.
Now, it would have taken me about five minutes to do all of this manually by splitting the video and then combining all the parts using some video editing program. But why would I waste five minutes when I can spend a whole weekend doing the same thing with Python?
I mean… am I right or am I right? Right?
Anyway…
We will use Python to analyze the audio from the video, find the “silent” parts in the audio, and use those to split the video into sub clips. Then, we will combine those sub clips in a random order and save it as a new video file.
Disclaimer: As this is one of the first times I’ve actually done this kind of signal processing and analysis, I’m pretty sure there are better and more optimized ways to do this kind of thing. If you have different suggestions, please share them in the comments.
Source code can be found on github.
Create a dev environment for our project
For this project, I used two main packages:
Start by creating a virtual / conda environment with the following packages:
We will use the librosa
package to analyze the audio from the video and find silent frames in it.
After that we will use those frames to split the video and combine them randomly using the moviepy
package.
You may also need to install ffmpeg
on your computer for processing the video files, since it’s needed for MoviePy
to run.
On OSX, you can install ffmpeg
by running the following command line:
Processing the audio frames from the video
First we’ll load the audio from our video file:
librosa.load
will load the audio data from the file as a numpy array into audio_data
and will set the sampling rate of the audio data in sr
, which is the number of samples per second in the audio. In this case, the sample rate was 22,050 Hz, which means 22,050 samples per second.
Lets take a look at the audio.
In order to do that we can use librosa.display
to plot the wave.
The code for that will use librosa
and matplotlib
packages:
If we run the above code on our video file we will get the following wave plot:
Seems like the audio for this scene is a bit noisy. We want to separate the video into sub clips, and will assume that there is a brief period of silence between them. In order to find the boundaries for each sub clip, we first need to define what “silence” looks like in our wave plot.
Let’s zoom into one of the parts in the audio and try to identify the “boundaries of silence”:
We get a better and clearer image of what the audio looks like in the “silent” parts:
It seems like we can declare that a location in the audio is silent if the wave is between -0.02
to 0.02
.
We can add these limit to the wave plot to see how much of it is “silent”:
We can zoom into one of the parts in the audio and check our “boundaries of silence”:
We can see that the boundaries we created contain the “silent” part of the audio. Seems like we might have been able to set a lower limit:
Once we set our limits, we can analyze the audio and try to find “silences”.
We don’t want to do this for each point since we have 22,050 of them per second, so we will define a frame
as a small chunk of the audio file. We do that by setting a frame size in seconds and then iterating over the audio data, frame by frame, and checking for each frame if it has a value that exceeds our limits. If all values in the frame are between the limits, we can say the frame is “silent”.
Labeling “silent” frames:
What we’ve done here is find the indexes of “silent” frames in the audio.
In this case the frame size is set to 0.1s, meaning that all data points in a 0.1s chunk of audio are considered a frame. This means that each frame contains 0.1 * sample rate (22050)
(2,250) data samples. This frame size worked well for this program, since we expect the sub clips to be longer than that. You can play around with this size and see how it changes the result.
For each frame, if the maximum absolute value of the frame is below our limit (0.02
) we mark it as a “silent” frame and collect it in our silent_frames
list.
Once we collected all indexes of the “silent” frames, we can plot them on our audio wave:
Looks like we marked a lot of sections as “silent”! We can aggregate consecutive “silent” frames, and plot the start of each “silent” sequence:
In this code section, we aggregated the sections of silent frames in order to create longer frames. These frames will be used later for splitting our video into sub clips, then mixing them together for our new Sam in the Citadel video mix. As we want to make sure that our frames are long enough, we also check that our aggregated frames are more than 3 frames long. This is set using the FRAME_MIN_SIZE
value.
We also added a functionality that searches for the lowest local minimum in our audio. We need this feature so that when we cut our audio/video into sub clips we will end each sub clip at the most silent point that we can find in our “silent” frames.
Let’s take another look at the cleaner plot:
We can now use the most silent point of each “silent” sequence as the beginning of each sub clip in the video. We will split the video accordingly, meaning we need to save a list of the start of each “silent” sequence and the end of the desired frame. For finding the end of the frame we will use the sample just before the start of the next “silent” sequence.
We will create a list of audio frames, each with a start and end index:
That’s it, we have everything we need in order to create our new (longer and better!) video of Sam’s Citadel chores.
Yes, Sam, it will be magnificent.
Creating the video mix
Now that we have a list of each audio frame’s start and end indexes, we can use it for creating our new video.
In this part we will convert the audio indexes to video timestamps, split the video using those timestamps and combine them randomly into a new video.
First, let’s load the movie using the moviepy
module:
Next, we’ll convert each audio frame’s start and end index to a video timestamp (MoviePy requires timestamps in seconds). We can convert sample indexes to timestamps by dividing the index by the sample rate. For example, the timestamp for the 1000th sample will be 1000 / 22050
, ~0.0453.
We will then use these timestamps to cut the video and save each sub clip:
Let’s take a look at some example sub clips:
The next step is to mix the sub clips that we created into a new video. We can create a new video that is longer than the original video by combining the same sub clips again and again in a random order.
We will create a new video clip that’s 5 minutes long (you can create longer clips but for this example, let’s keep it short):
The result will look something like:
That’s it, we now have a longer version of Sam working in the Citadel!
Source code can be found on github.
What’s next?
You’re probably wondering, what else we can do with this?
Well, I have only one thing to say about that:
10 hours of OYSTER, CLAMS and COCKLES!