Reflection for My Last Semester at OMSCS

Introduction

Spring 2021 was my last semester in the master program at Georgia Tech. I chose to push my limit and elected two courses:

  • 6515 Graduate Introduction to Algorithms
  • 7210 Distributed Computing

The second course is new on OMSCS. Nobody has elected this one before, we are literally the first batch of students to test the course.

It turns out the course is excellent and my passion on distributed systems aligns with the contents very well.

Course Contents

Let’s first see what are included in syllabus:

  • Introduction to Distributed Systems
  • Primer of RPC
  • Time in Distributed Systems
  • State in Distributed Systems
  • Consensus
  • PAXOS and Friends
  • Replication
  • Fault-tolerance
  • Distributed Transactions
  • Consistency and Geo-Distributed Data Stores
  • Peer-to-peer, Mobility
  • Distributed Data Analytics
  • Distributed Machine Learning
  • Support for Datacenter-based Distributed Computing
  • Byzantine Fault Tolerance, Blockchain
  • Edge Computing, IoT

The course covers a large range of theory and real-life practices in distributed systems. While it starts from fundamentals, it also goes up to the latest edge development of the industry, including graduate level topics like consensus, replication and fault-tolerance etc.

For the suggested readings, you can find classical articles and industry showcases from the companies running the largest distributed (and geo-distributed) systems in the world. Some examples:

  • Raynal and M. Singhal. Logical Time: A Way to Capture Causality in Distributed Systems (Links to an external site.). IRISA Technical Report. (up to Section 7)
  • Lamport, Time, Clocks and The Ordering of Events in Distributed Systems
  • Spanner: Google’s Globally-Distributed Database
  • Scaling Memcache at Facebook
  • Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

One of the hottest topics – machine learning – is also adapted in this course in a DS fashion. You will learn in a geo-distributed scenario, what are the advanced ways to make sure the learning system is working at most efficiency, utilizing compute power globally.

Projects

Well, I would say this course has the most hardcore projects among all the ones I studied in the program. It is even much harder than those in 8803 Compiler. You will be asked to implement a bunch of things which are critical components/concepts of a distributed system:

  • Key-value stores
  • PAXOS consensus
  • View server
  • State transition and consistency in distributed system
  • Two-phase commit
  • Distributed transaction
  • Async message delivery in distributed system

Projects are kind of related as one would be the base of another, especially for project 4 and 5.

The only complaint I have is the instructions, which can be improved to state the project intent and requirements much better. Most time I struggle understanding what should be done and what is the expected way to do it.

There are a bunch of tests running on GradeScope to evaluate your implementation. Two types of tests to expect: The RUN tests and SEARCH tests. Search test is a relatively new concept to me. The test code is not checking if the input/output are valid, but walking through the STATEs that your program can reach and see if any invariant violation. The concept reminds me something learnt in CS6340 Software Analysis and Test, which tests for invariants of code block, loops and branches.

The search test can be very difficult to pass reliably if the logic of the implementation is not or near to perfect. Any edge case, including super trivial ones, will be caught when the test walks into a state that is violating any assumption. Compare to run tests, search tests is very efficient in finding missed edge cases, along with the DSLab framework, to wrap everything in-memory (including I/O operations).

Difficulty of projects increases tremendously after Project 3 (as of Spring 2021). The first two are quite straightforward and simple, if you have some exposure to software engineering. You implement a basic key-value store, extending the provided interface and base Application class. After that, PAXOS comes into play and you are going to scratch (to the ground) your head for over 50 hours to implement the famous consensus protocol from Lamport’s thesis Paxos Made Simple. Well it feels like the “How to draw an owl meme”:

For sure, you will learn A LOT at the end, when you finish all five projects. It’s okay if you cannot get full score in all of them, I did not in the last one. Still, learn by coding is a good approach for studying distributed system, and computer science.

Conclusion

By striving through the course, I firmly believe that my passion is with Distributed Systems and I want to continue this journey. Go deep and go wide, that’s what I will be trying to do.

I strongly recommend anyone who is interested in the topics to take this class, it is well-organized, content rich, with good support from the professor and TAs. The topics are not stale, you could even leverage most of things you learnt in your job, if you work in the cloud industry.

Also applied to be an TA in this course, see you in Fall 2021.

Leave a Reply