January 2020 - February 2020
In my large scale database storage and retrieval class, we learned about how twitter started off using a relational database model and switched over to document/key-value storage system. In order to get a deeper understanding, our project was to create a Twitter mockup consisting of 1 million tweets and 15000 users.
Here is the link to the code. I randomly generated tweets and user follower/followee data and stored that into a CSV file. I then iterated through the data and inserted it into MySQL. My laptop was able to keep up with Twitter's current 6000 tweets per second inserts.
My next step was to create a home timeline analysis that simulated a home screen refresh. In order to do this, I had to join all the followers a user follows to tweets those users tweeted. This is where MySQL struggled to keep up with Twitter's 500,000 home timeline updates per second. I was only able to get about 5.5 refreshes a second.
We then took what we learned and applied it to Redis. We had to come up with a way to store user follower/followee relationships and the tweet information in Redis. We implemented Twitter into Redis using two strategies:
1) Just store Tweets and user follower/followee relationships
2) Store the above data and also implement a home timeline database where each tweet added posts a reference to it to all the followee's home timeline database of the user who posted the tweet. This would allow the home refreshes to be pregenerated and this is the approach Twitter takes in real life.
We used a different laptop to test Redis, but we did find that Redis did a better job than MySQL to store tweets. Redis took a long time pulling up the home timeline for strategy 1 but did incredible with strategy 2. Because we were inserting the tweet references to all the followee's home timeline databases for Strategy 2, it took longer for our program to store tweets. This was a trade-off we had to make.