2020 was a crappy year, no doubt. There maybe more bads than goods for many people. Still, we must always remember that there are still goods that we can and should appreciate even during the hard times. As such, I wanted my reflection to be an appreciation of something. This brought me back to this post on /r/dataisbeautiful where the author did an analysis on their relationship through text messages. Since I know a little bit of Python, I thought why don't I give it a try too, it could be a good learning opportunity for me as well as a way for me to appreciate this relationship.
Me and my SO talked mostly on Telegram and only ocassionally on Messenger. For this analysis, I went ahead with only Telegram data as it should covers majority of our conversations. If you're using Telegram Desktop, you can export your chat data either by JSON or by HTML. I chose JSON export option and excluded all photos to reduce the download time. The exported
result.json contains all of my text messages, but I only want my SO data, so I extracted only our texts.
Next is to convert to pandas data frame. Since the data will only consists of 2020, I made a new column for date time and filter by year.
df = pd.DataFrame(msg_data) df['date_converted'] = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%S') year_mask = df['date_converted'].dt.year == 2020 df = df[year_mask]
Telegram data uses
text field as mixed object type for both string-based text message and photo-messages, but for this purpose I only need string-based text, so I added a new column for text-based-strings
df['str_text_only'] = df['text'].astype('string')
With this data ready, we can start analyzing the texts
Who send more?
My first question is which one of us send more texts than the other, this can easily be calculated by filtering with
from_id column. Out of total 67K messages, I sent 29K and my SO sent total of 38K messages. If we plot this on a graph (I used matplotlib for my graph, pardon my default-looking graph because I'm not familar with the library), based on each month, we can see that my SO sends a lot more text compared to me in most of the months throughout 2020.
I showed my SO the result and we were both suprised to find out that she sends a lot more than me. Then, we came up with a theory that the difference maybe because we have a practise where when we call at late night, I would talk over phone and she would reply with text because either the rest of the family at her home is sleeping or she doesn't want to speak while her parents are around.
When are we active most?
Of course we talk day and night but there should be a peak time where both of us are actively talking with each other. So what could be ours? Again, I plotted this on a bar chart to see when are we active the most:
This could be one of evidence to my theory that the huge amount of difference could be because of our phone call practice at late night and you can see on the graph here as well that my SO has a lot of messages sent at late night on average compared to me. Our peak conversation seem to be around 9 PM to 11 PM. During day time, we seems to be consistent every hour in giving time to this relationship which is a really wonderful thing.
How long do we call?
Another question I have is how long do we call each others in our 2020 period? For this, I decide to use scatter plot, with hour on x axis and duration on y axis. This part of code is probably the most cleanst I wrote so I will show you guys here 🤣. For this code, I did some Googling because I was stuck on how scatter plot works, thankfully this data analysis helps me out and I copied the hour conversion logic from there
I found out that when I generate the plot, it has a lot of 0 second duration in it. I thought something is not right so I took a look at Telegram data and saw that sometimes the call either didn't went through or it got cancelled due to Internet problem. Telegram filter this by
discard_reason field. I had to filter these out to get the acutal phone call data, so I set up a filter that takes only if the phone is explicity hanged up by one or the other and that duration is at least 60 seconds (because sometimes there are cases we pick up the call and said we're busy and will call back)
phone_mask = df['action'] == "phone_call" more_than_zero_mask = df['duration_seconds'] > 60 hangup_mask = df['discard_reason'] == 'hangup' phone_call_df = df[(phone_mask) & (more_than_zero_mask) & (hangup_mask)]
I also added a column to map seconds to minutes for ease of viewing
df['duration_minutes'] = df['duration_seconds'].map(lambda seconds: seconds / 60)
And if we plot this on Scatter plot
fig, ax = plt.subplots() ax.scatter(phone_call_df['hour'], phone_call_df['duration_minutes'],alpha=0.5) ax.set(xlabel='Hour of day', ylabel='Phone Call duration (minutes)') ax.set_title('Phone Calls throughout 2020') ax.grid(True) plt.show()
We get the following:
From the plot, you can see that we call each other most during 9 PM to 11 PM, and most of our phone call does not exceed 20 minutes. Sharing this with my SO also made us also remember that one time we talked over 2 and a half hours. I'm happy to know that my effort in this analysis brought back some of our memories along the way.
I was thinking of doing a sentimental analysis as well to graph out our overall mood during our conversations. The problem, however, is that we talk mostly in Burglish (Transliteration of Burmese), and sadly I don't have a sentimental dataset for Burglish. On top of that, since I don't know Burglish stopwords, I couldn't analyze what's our top words as well. If anyone has a tip on this, please feel free to email me as well. I promise to treat you a coffee when we meet.
Well, that's it for my 2020 romance reflection. I also made a word cloud but I haven't recieved permission from my SO to share since some of the words are better kept private between us. Anyhow, that's it for this blog post. Happy New Year to everyone.