Building a lyrics recommendation bot

Published on December 08, 2023 under the Coding category.

Have you ever wondered “What {artist} lyric would be most appropriate in this context?” when you are typing out a message, where “{artist}” is your favourite singer or band? I sure have! In my case, the artist of choice was Taylor Swift. If I said something like “cost”, what Taylor Swift lyric would match? Herein lies a fun project.

I decided to create an IRC bot for use in a community I am in that, given any sequence of words, would return the most related song lyric. In this post, I want to talk about the three main stages of building this project, with reference to code, so that you can use the underlying technology to make your own bots (whether they are about song lyrics or something else entirely!).

Here are some queries and responses from the bot:

Query: dating
Bot: That's what people say, mmm-mmm I go on too many dates

Query: own your data
Bot: Taking mine, but it's been promised to another Oh, I can't

Query: personal website
Bot: Having adventures on your own You meet some woman on the internet and take her home

Data collection and processing

To build a bot, we need data. For this, I searched Kaggle for a dataset that contained as many lyrics as possible. The highest quality one that I found, as measured by the number of albums included and the structure of the dataset, included nine of ten Taylor Swift albums. Her most recent new album, Midnights, was not included. A shame indeed but not a blocker for this project. Kaggle has a vast range of text datasets. Explore to find one that you like for your project.

Next, I had to clean up the data. The dataset with which I was working contained one .csv file for every album. Each CSV contained a single line of lyrics and metadata about the lyric (the line number in the song, the title of the associated album). I merged all of these CSVs together and created a data structure with the following information:


{
	"lyric": "But I keep cruisin' Can't stop, won't stop movin'",
	"album_name": "1989",
	"song_title": "Shake it Off",
}

I decided to store the data in JSON because in the next step I knew I was going to be working with multi-dimensional arrays.

I bundled lyrics so that every lyric in the JSON object contains two lines concatenated into one.

Calculating embeddings

“James,” I hear you ask, “how do you evaluate what lyrics are most related to a message?” That is where embeddings come in. Embeddings are long lists of numbers that encode semantics about text. Embeddings are calculated with language models trained on a vast array of text. Embeddings have many applications in natural language processing and computer vision.

Importantly, embeddings can be compared. Consider these two lyrics:

“The highlight of my day I’m taking pictures in my mind”
“That’s what people say, mmm-mmm I go on too many dates”

Using embeddings, we can compare a prompt like “picture” to the above two lyrics. This calculation will let us see which lyric – or, rather, which embedding calculated from a lyric – is most similar to the prompt. In the above example, a prompt with the text “picture” is closest to the first of those lyrics than the second. The first lyric is about taking pictures, the second one is about dating. But if our prompt was “date”, the second would be most relevant.

How do you calculate embeddings? For that, we are going to use a Python library called Sentence Transformers. Sentence Transformers allows you to use a wide range of natural language processing models. For this guide, we will use an embedding model to calculate text embeddings: sentence-transformers/all-MiniLM-L6-v2.

Create a new Python file and add the following code:

import sentence_transformers
import json

model = sentence_transformers.SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2"
)

for item in lyrics:
    embedding = model.encode(item["lyric"])

    item["embedding"] = embedding.tolist()

with open("lyrics2.json", "w") as f:
    json.dump(data2, f)

In this code, we load a language model. For each lyric in the data structure we defined in the last section, we calculate an embedding. this happens in the model.encode() function call. We convert the embedding to a list, then save it in our JSON structure. Once we have calculated all embeddings, we save the results in a file.

Note: If you are working with a large dataset of tens of thousands of lyrics, you may want to add logic to incrementally save embeddings. I am using one JSON file to store my data since my dataset is a few thousand lyrics which my computer can store in memory during the embedding calculation stage.

Finding related lyrics

Now we have a new value in our JSON structure: embedding. We can compare these embeddings to a message to find the most related song lyric to a message. I decided to make this an IRC bot, but you can extract the logic for any use case (an API, a Discord bot, or whatever else you want to build).

I needed code that could connect to IRC and listen for messages. If a message started with a command phrase (in this case, !ts), the words after the phrase could be used to find related lyrics. I used the following code, in a new file:

import pydle
import sentence_transformers
from sklearn.metrics.pairwise import cosine_similarity
import json

model = sentence_transformers.SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2"
)

with open("lyrics2.json", "r") as f:
    data2 = json.load(f)


class Bot(pydle.Client):
    async def on_connect(self):
        await self.join("#channel")

    async def on_message(self, target, source, message):
        if source != self.nickname and message.startswith("!ts"):
      			message = message.lstrip("!ts")
            vector = model.encode(message)

            max_sim = 0
            max_sim_idx = 0

            for item in data2:
                sim = cosine_similarity([vector], [item["embedding"]])[0][0]

                if sim > max_sim:
                    max_sim = sim
                    max_sim_idx = data2.index(item)

            await self.message(target, data2[max_sim_idx]["lyric"])


client = Bot("swiftbot", realname="swiftbot")
client.run("irc.libera.chat", tls=True, tls_verify=False)

In this code, we:

Load our embedding model;
Create an IRC bot, and;
Program our IRC bot to listen for messages that start with (!ts).

When our IRC bot gets a message that starts with !ts, we calculate an embedding for the message. Then, we compare that embedding to every embedding in the JSON structure we defined earlier. We use cosine similarity, a common metric used for comparing embeddings. We retrieve the lyric associated with the lyric with the highest similarity to the user’s message and send it back to the chat.

For my dataset of ~8,000 lyrics, it takes a few seconds to compare every embedding to the one calculated for the message. There are more efficient ways to do this, such as by using a vector database, but this was a short evening project and I didn’t want to add such logic. With that said, the delay to compute similarities introduces a welcomed delay in the bot responding to a query. It usually takes 2-3 seconds for the bot to respond, which feels more natural than an instant response (you can add an artificial limit, too, but this was an accidental but delightful result of the code I wrote).

If you use the IRC code above, you should set your own bot name, change irc.libera.chat to the IRC network to which you are connecting, and change #channel to the name of the channel to which you want to use.

Conclusion

This was a delightful project to work on. I now have ready access to the most apt Taylor Swift query for a phrase! I do want to add Midnights lyrics into this bot, but that is a task for another day.

I will end with another two responses from the bot:

Query: nyc pizza
Bot: I'm New York City I still do it for you, babe

Query: all the cost
Bot: (Evermore) Can't not think of all the cost