Analyzing my Facebook Messages

CIS 545 Final Project

By Grace Jiang | View Source Code Here

Introduction

I've been using Facebook as my primary means of communicating with my friends and family since 2012. I decided to analyze my messaging habits and history over the past 8 years using Pandas.

This project interests me since I want to analyze how frequently I talk to different friends throughout different periods of my life, as well as different metrics such as the average "duration" of my close friendships, how the language I have used has changed over time, and my "happiness" trends based off NLP analysis. My ultimate goal is to learn more about my messaging habits over the years.

This project is open source, meaning anyone who downloads their own Facebook messenger data and uses my code should also be able to look at their own messaging trends over the years!

 

1. Data Acquisition & Cleaning

Facebook allows its users to download their user data in their settings tab. The data comes as one large folder that contains one folder per conversations, and each of those folders contain one or more json files that store the conversation history.

I loaded all the data into one big dataframe by looping through the root directory and reading in any files that matched the "*.json" extension.

 

Afterwards, I did some basic data cleaning such as dropping NaN values and type-conversion in my date columns.

 

Finally, after loading all my data into one big fat dataframe, I was ready to analyze my messaging history!

2. Basic Metric Analysis

I started off by measuring some different metrics of my messaging history, such as the number of messages I've sent and received, as well as how many total conversations I've had with different people.

Total Number of Messages in All Conversations: 2,732,052

Total Number of Messages Sent: 1,287,000

Total Number of Messages Received: 1,445,052

Number of Different Conversations: 2,557

Most Messages Received From Contacts:

 

3. Diving a Little Deeper & Close Friends

After looking at the the statistics from my basic metric analysis, I decided I wanted to analyze my messaging habits over the years more closely.

 

(1) Grouping Messages Sent/Received by Year

 

(2) Plotting the Results

 

Total Messages Sent and Received Over Time

total-msgs

 

Interesting! I was also curious on seeing who exactly I was messaging at these specific periods of time, so I decided to also breakdown my data by person. Because I had over 2,500 conversations, most of which were under 100 messages, I also decided to filter my conversations to include only "close friends", which I arbitrarily defined as anyone who sent me over 25,000 messages (which meant that we would have roughly 50,000 total messages together).

 

Again, the next step was to graph my results:

 

Messages Received Over Time, Broken Down By Person

close-friends

...And After Removing Outliers...

Code to remove any outliers:

close-friends-2

 

Language Usage Over Time

Finally, I wanted to analyze how the language I've used has changed over the years. I decided that the best way to do this was by using word clouds to visualize what words and phrases I used most frequently throughout the different years. (I also wrote code that told me what specific phrases I used the most, but as you can probably tell, I really like visual representations!) I did this by:

(1) Filtering the dataframe to only include messages that I sent in a certain year

 

(2) Importing stopwords, as well as adding some of my own

 

(3) Splitting each message into a list of words, and adding each word to an overall words list

 

(4) Generating a wordcloud using the word list

 

Here were the final results!

2012 Word Cloud

wordcloud2012

Note: The 'd' and '3' are most likely from 12-y/o me using ':D' and ':3' emoticons excessively

2016 Word Cloud

wordcloud2016

 

2020 Word Cloud

wordcloud2020

 

4. NLP & Sentiment Analysis

Happiness Trend over the Years

After seeing the different language I've used over the years, I thought it would be interesting to analyze the general sentiment in my language. The messages I sent at the time are probably a good indicator of how positive/happy I was feeling at the time, so this would be a cool way to analyze how my happiness levels have changed over the years.

After looking up several libraries online, I decided that the easiest way to do this was by using the library TextBlob.

 

(1) Analyzing Sentiment from a Dataframe

 

(2) Graphing the Dataframe

 

Here were the results!

Happiness Chart

happiness

No idea why I was so sadboi in 2014. That was the year I started high school, so maybe that's partly why?

5. Modelling & Relation to CIS545 Material

Yay! The last thing I did was create models for my data. I chose to analyze messages between myself and one of my close friends, AC.

Linear Regression to Predict How Many Messages I'll Receive From AC

Before writing the linear regression for messages between myself and AC, I decided to first analyze our basic messaging trends, using a similar method as part 1.

Messaging Trends

ac

 

Next, I wrote a simple linear regressor using the library sklearn to predict the number of messages we would send each other this next month.

Code

 

Resulting Linear Regression Charts

lr-grac

lr-ac

 

I then used a more complex linear regression model based off the one we learned in class.

 

Dimensionality Reduction using PCA

 

Machine Learning to Predict Who Sent What Text

Finally, I thought it would be cool to write a machine learning model to predict who sent what message.

(1) Labelling Data to Who Sent What Message

 

(2) Training our Sets

 

(3) Building a Linear Classifier off this Data using Logistic Regression

 

(4) Prediction Accuracy

I ended up with an accuracy of around 68%, which is not much better than guessing, but better than nothing.

 

Conclusion

I've always wondered about my messaging habits over the years, so this has been a project I've been wanting to tackle for a long time. Overall, my findings were about what I expected.

The most challenging part of this project was debugging different syntax errors and figuring out how to do the regression analysis.

My favorite part of this project was seeing the visualizations for how often I messaged my closest friends over the years. I also thought that seeing the wordclouds for the language that I used over the years was interesting.