January 2020 – April 2020

For my Software Development class at Northeastern University, we developed a distributed database in C++. We first started by developing basic data structures to get used to the C++ language. This includes Arrays, Linked Lists, and Maps. We then moved on to develop Dataframes in C++. Much like Python's Dataframes, our Dataframes were a table where each column was either a type int, float, bool, or string where the string type was a class we developed using chars. Throughout this project, we got to make many choices associated with how we wanted to allocate memory and how we wanted to use pointers. Over time, we added on multiple features for our Dataframes like the ability to add new rows and columns and run functions that map over the whole Dataframe (like adding two to every int).

Our next objective was to find a way to take classes and serialize data so that information from a class can be sent over the wire in a char format and then deserialized and reassembled as a class from the other end. To do this, my partner and I developed our own way of deciphering how we send data and how the data we send will be split. We then had to develop a network where clients would connect to a server and then the server would let each client know each other's IPs and ports. Using this information, each client would then be able to connect to each other allowing us to create a distributed network (meaning no server is needed moving forward). Eventually, our final project consisted of multiple abstracted classes which each serve a specific purpose (this is where our object-oriented programming comes in). From the top-down, our professor would give us Applications that would use our Dataframe class to create Dataframes consisting of whatever data they want. We were in charge of breaking that data down, serializing the data, and sending in over our Network class to all the other clients so that each client would store a fraction of the data in their KVStore classes. The Application layer, form any client, would then be able to either run localized functions on each client and sum up the data between all clients to get final answer (like each client can sum up the words on their own database and return it to node 0 where node 0 would add all the sums up to return the total word count) or the Application layer can ask for the Dataframe back and our KVStore in that client would deal with the Networking to ask the correct clients for their piece of the Dataframe to rebuild the Dataframe back and hand it to the Application layer to run their functions. This was a very complicated project and throughout the four months, we learned a lot. In each step of developing this project, we had to make decisions about how we can get specific parts of the project working. For example, to serialize data, we chose to use "}" to break up one piece of data over the next and we would send metadata in a specific order so that the next client can retrieve the data in a specific format to be able to rebuild that data. We found that each decision we made would impact the decisions we made on other layers of our code. We learned that this layered approach to coding is very helpful when we want to not have upper-level layers care about how the lower-level layers work. For example, the Application layer just tells the KVStore database layer what Dataframe to store. It will then ask for the Dataframe back later and does not have to care about how the KVStore stores the data. In the same way, the KVStore simply breaks up the given Dataframe and tells the Networking layer what piece of data it needs to send to what client node number. The KVStore layer doesn't have to care about how the networking is set up based on the IP and ports and how it serialized and deserialized data. You can find the git repository for my code here. Go Home