Oxyprogrammer's blog

Practical Approach to learn memory optimization using goroutines

Siddhartha S — Fri, 16 May 2025 11:09:19 GMT

Introduction

After writing Tenets of Multithreading in Go: Detailed Tutorial, I felt compelled to share a problem I had to tackle during my work as a software engineer. Real-life software engineering often involves solving unique challenges, each with its own narrative.

In this article, we will analyze four different approaches to a specific problem step-by-step. We will evaluate each solution, weighing the pros and cons, to demonstrate that there is no “one-size-fits-all” approach, as the best solution depends on the circumstances at hand.

By the end, we will compare all four approaches and make an informed choice based on our case.

The problem

The problem itself is not overly complicated. We had an NFS file location where numerous CSV files were being generated by an upstream process over which we had no control.

The task was to set up a cron job that would execute at regular intervals to check for newly created files. If new files were found, the job would upload their contents into a database and subsequently move the files to an archive location. A new instance of the cron job would be launched for each file detected. The requirement was to create a Go program to perform the cron job tasks.

Code Walkthrough

To test the various approaches, I created an API application in Go. Here is the repository. When you run the application for the first time, it creates a sample CSV file with 10 million records (approximately 0.5 GB).

The API provides four endpoints:

GET /solution-one
GET /solution-two
GET /solution-three
GET /solution-four

All of these endpoints perform the same tasks:

Read all lines from the file.
Convert to read entities.
Transform into write entities.
Insert the write entities into memory.

The application uses GORM as an ORM to facilitate easy bulk insert functionalities, configured to insert in batches of 1000. Each endpoint returns a response that includes:

Time taken for the solution.
Memory consumed for the operation.

The database used in the application is SQLite, which allows only one connection at a time, creating a bottleneck during write operations. For databases like PostgreSQL, the readings gathered from the runs of different solutions may yield different results, especially for the multithreaded approaches.

Solution 1: Brute force

The brute force method involves the following steps:

Read all lines from the file into memory.
Convert read entities and hold them in memory.
Transform all read entities into write entities.
Insert all write entities into the database.

Here’s the straightforward code:

func LoadAllAndInsert(database *db.DB) error {
    landReadModels, err := utils.ReadCSVAll("data/land_feed.csv")
    if len(err) > 0 {
        log.Fatal(err)
        return errors.New("Some error happened. Check logs.")
    }


    var dbLandModels []*models.Land


    for _, landReadModel := range landReadModels {
        dbLandModels = append(dbLandModels, models.FromReadModel(landReadModel))
    }


    eror := database.CreateLands(dbLandModels)
    return eror
}

The response received is as follows:

{
    "error": "",
    "message": "Successfully completed request for Load All And Insert In Batches.",
    "memoryUsage": 3296896352,
    "elapsed": 18968041000
}

Solution 2: Buffered inserts along with reading

This method utilizes buffered inserts while reading:

Read a line from the file and convert it to a read entity.
Convert the read entity into a write entity.
Push the write entity into a buffer.
If the buffer size is full, perform a bulk insert into the database; if not, continue populating the buffer.

Here’s the code:

func ReadLineAndAndInsertInBatches(database *db.DB) error {
    file, err := os.Open("data/land_feed.csv")
    if err != nil {
        return err
    }
    defer file.Close()
    // Create a CSV reader
    reader := csv.NewReader(file)

    //Read the headers
    _, _ = reader.Read()
    lineNumber := 1
    var buffer []*models.Land

    for {
        record, err := reader.Read()
        if err != nil {
            if err == io.EOF {
                break
            }
            return err
        }
        readEntity, err := utils.CreateEntityFromRecord(record, lineNumber)
        if err != nil {
            return err
        }
        dbEntity := models.FromReadModel(*readEntity)
        buffer = append(buffer, dbEntity)
        if len(buffer) == 10000 {
            database.CreateLands(buffer)
            buffer = buffer[:0]
        }
        lineNumber++
    }
    // Flush any remaining entities in the buffer
    if len(buffer) > 0 {
        database.CreateLands(buffer)
    }
    return nil
}

The response received is as follows:

{
    "error": "",
    "message": "Successfully completed request for Read Line And Insert In Batches.",
    "memoryUsage": 3851704,
    "elapsed": 26142007900
}

Solution 3: Using goroutines to optimize: Worker Pool

To further optimize the process, I implemented goroutines with a worker pool approach. This method employs two channels: readChannel for read entities and doneChannel for signaling completion. The process consists of the following:

A Producer goroutine reads each line from the CSV file, converts it into a read entity, and sends it to the readChannel.
Five Consumer goroutines listen to the readChannel, convert each read entity into a write entity, and maintain a buffer of 10,000 write items to be written to the database. When the buffer is full, it is written to the database, and the buffer is cleared.
Once all worker consumers signal completion, the doneChannel is triggered, allowing the parent method to exit.

Here’s the implementation:

func MultiprocessingForReadingAndWriting(database *db.DB) error {

    readCh := make(chan *models.LandRead, 1000)
    doneCh := make(chan struct{})
    errCh := make(chan error)
    const numWriters = 5 // Number of writing goroutines
    go readAndProduceModelAsync(readCh, errCh)

    var wg sync.WaitGroup
    var mutex sync.Mutex

    for i := 0; i < numWriters; i++ {
        wg.Add(1)
        go writeAndConsumeAsync(database, readCh, errCh, &wg, &mutex)
    }

    go func() {
        wg.Wait()
        close(doneCh)
    }()

    for {
        select {
        case err := <-errCh:
            return err
        case <-doneCh:
            return nil
        }
    }
}

func readAndProduceModelAsync(readCh chan<- *models.LandRead, errCh chan<- error) {

    defer close(readCh)

    file, err := os.Open("data/land_feed.csv")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()
    // Create a CSV reader
    reader := csv.NewReader(file)
    //Read the headers
    _, _ = reader.Read()

    lineNumber := 1

    for {
        record, err := reader.Read()
        if err != nil {
            if err == io.EOF {
                return
            }
            errCh <- err
            return
        }

        readEntity, err := utils.CreateEntityFromRecord(record, lineNumber)
        if err != nil {
            errCh <- err
            return
        }

        readCh <- readEntity
        lineNumber++
    }
}

func writeAndConsumeAsync(database *db.DB, readCh <-chan *models.LandRead, errCh chan<- error, wg *sync.WaitGroup, mutex *sync.Mutex) {
    defer wg.Done()

    var buffer []*models.Land
    batchSize := 10000

    for readEntity := range readCh {
        dbEntity := models.FromReadModel(*readEntity)
        buffer = append(buffer, dbEntity)


        if len(buffer) == batchSize {
            mutex.Lock()
            err := database.CreateLands(buffer)
            mutex.Unlock()
            if err != nil {
                errCh <- err
                return
            }
            buffer = buffer[:0]
        }
    }
    // Flush any remaining entities in the buffer
    if len(buffer) > 0 {
        mutex.Lock()
        err := database.CreateLands(buffer)
        mutex.Unlock()
        if err != nil {
            errCh <- err
            return
        }
    }
}

The response received is as follows:

{
    "error": "",
    "message": "Successfully completed request for Multiprocessing For Reading And Writing.",
    "memoryUsage": 5289472,
    "elapsed": 24760674200
}

Solution 4: Further squeezing with worker pool pipeline

In this final solution, we enhance the previous design by introducing an additional type of routine to streamline the processing pipeline. The parent method establishes three channels:

recordChannel for raw string records from the file.
transformChannel for converted write entities.
doneChannel for signaling the completion of the operation.

The approach consists of:

A Producer routine that reads lines from the CSV file and sends the split strings to the recordChannel.
Five Transforming routines that listen to recordChannel, convert raw records into read entities, and publish the transformed write entities to the transformChannel.
Five Writing routines that listen to the transformChannel, maintain a buffer of 10,000 write entities, and write them to the database when the buffer is filled.
The doneChannel is signaled once all writing routines are completed, allowing the parent method to exit.

Here’s the implementation:

type Record struct {
    LineNo int
    Data   []string
}

func MultiProcessingForReadingTransformAndWriting(database *db.DB) error {
    recordCh := make(chan *Record, 1000)
    transformCh := make(chan *models.Land, 1000)
    doneCh := make(chan struct{})
    errCh := make(chan error)
    const numTransformers = 5 // Number of transformation goroutines
    const numWriters = 5      // Number of writing goroutines

    go readAndProduceRecords(recordCh, errCh)

    var wg sync.WaitGroup
    for i := 0; i < numTransformers; i++ {
        wg.Add(1)
        go transformAndProduceDbModel(recordCh, transformCh, errCh, &wg)
    }

    var writeWg sync.WaitGroup
    var mutex sync.Mutex
    for i := 0; i < numWriters; i++ {
        writeWg.Add(1)
        go writeAndConsumeDbModel(database, transformCh, errCh, &writeWg, &mutex)
    }

    go func() {
        wg.Wait()
        close(transformCh) // Close transformCh once all transform goroutines finish
    }()

    go func() {
        writeWg.Wait()
        close(doneCh)
    }()

    for {
        select {
        case err := <-errCh:
            return err
        case <-doneCh:
            return nil
        }
    }
}

func readAndProduceRecords(recordCh chan<- *Record, errCh chan<- error) {
    defer close(recordCh)
    file, err := os.Open("data/land_feed.csv")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()
    reader := csv.NewReader(file)
    _, _ = reader.Read() // Read the headers
    lineNumber := 1
    for {
        record, err := reader.Read()
        if err != nil {
            if err == io.EOF {
                return
            }
            errCh <- err
            return
        }
        recordCh <- &Record{
            LineNo: lineNumber,
            Data:   record,
        }
        lineNumber++
    }
}

func transformAndProduceDbModel(recordCh <-chan *Record, transformCh chan<- *models.Land, errCh chan<- error, wg *sync.WaitGroup) {
    defer wg.Done()
    for record := range recordCh {
        readEntity, err := utils.CreateEntityFromRecord(record.Data, record.LineNo)
        if err != nil {
            errCh <- err
            return
        }
        dbEntity := models.FromReadModel(*readEntity)
        transformCh <- dbEntity
    }
}

func writeAndConsumeDbModel(database *db.DB, transformCh <-chan *models.Land, errCh chan<- error, wg *sync.WaitGroup, mutex *sync.Mutex) {
    defer wg.Done()
    var buffer []*models.Land
    batchSize := 10000
    for dbEntity := range transformCh {
        buffer = append(buffer, dbEntity)
        if len(buffer) == batchSize {
            mutex.Lock()
            err := database.CreateLands(buffer)
            mutex.Unlock()
            if err != nil {
                errCh <- err
                return
            }
            buffer = buffer[:0]
        }
    }
    // Flush any remaining entities in the buffer
    if len(buffer) > 0 {
        mutex.Lock()
        err := database.CreateLands(buffer)
        mutex.Unlock()
        if err != nil {
            errCh <- err
            return
        }
    }
}

The response received was as follows:

{
    "error": "",
    "message": "Successfully completed request for Multiprocessing For Reading, Transform And Writing.",
    "memoryUsage": 4515240,
    "elapsed": 23637820200
}

Conclusion

The following table provides a comparative view of all four approaches:

Endpoint	Description	Memory Usage (bytes)	Elapsed Time (ns)
`GET /solution-one`	Load all data and insert in batches.	3,296,896,352	18,968,041,000
`GET /solution-two`	Read line by line and insert in batches of 10,000.	3,851,704	26,142,007,900
`GET /solution-three`	Use multiprocessing with a worker pool for reading and writing.	5,289,472	24,760,674,200
`GET /solution-four`	Pipeline approach on worker pool for reading, transforming, and writing.	4,515,240	23,637,820,200

From the results, we can observe the following insights:

Solution One is the fastest but consumes a disproportionate amount of memory as it loads everything into memory.
Solution Two uses the least memory but is the slowest.
Solution Three slightly improves time over Solution Two while drastically reducing memory usage compared to Solution One.
Solution Four further improves memory usage (second best overall) and execution time (also second best).

In many cloud environments, prolonged computational costs can lead to overshooting budgets. Thus, Solution Four stands out as the optimal choice for our scenario. While it may not be the absolute best for speed or memory usage, it strikes a balance, ranking second best in both metrics. This reflects the utility of multithreading: while it doesn’t always deliver the best results outright, it often yields the best overall performance across multiple criteria.

In this article, we examined a real-life problem and applied our knowledge of multiprocessing to arrive at an optimal solution.

Architecture of a NextJS app on AWS: an interview story

Siddhartha S — Fri, 14 Mar 2025 06:01:30 GMT

Introduction

In a recent job interview for an Architect role, I was asked to create a deployment plan for a Next.js application. While the straightforward approach of deploying the app on Vercel directly from GitHub may seem like a viable option, it may not be the best choice for enterprise-level applications. Vercel can be costly for high-traffic applications, and there is less control over security and scalability aspects. Additionally, if the application has multiple components like a database, cache, and file system, the costs can quickly escalate.

In this article, I will share the thought process I used to address the interviewer's concerns and propose a more cost-effective and scalable deployment solution using AWS services. The aim is not only to showcase the various AWS services that can be leveraged, but also to provide an example of how real-world interviews can progress and the thought process that should guide the response.

This article follows an unconventional format, experimenting with an interview-style narrative. I, as the candidate, will present my proposed solutions, and the interviewer will provide feedback and additional constraints to challenge the solution. Through this interactive format, I hope to demonstrate the thought process and adaptability required in a technical interview setting.

The target audience for this article includes DevOps professionals, full-stack developers, and AWS enthusiasts who are interested in exploring deployment strategies for Next.js applications.

Ice Breaker

Interviewer: What are your preferred technologies for full-stack application development?

Me: It depends on the application size and scope. For enterprise-level applications, I prefer C# and Angular/React. For distributed applications with smaller services, I tend to use Golang. And for small to medium-sized projects that require less compute-intensive operations, I often recommend Next.js.

Interviewer: That's interesting. Let's assume we have a medium-sized application, and since you prefer Next.js, how would you suggest deploying the application?

Me: Well, Vercel and similar cloud providers offer great integration with GitHub, providing out-of-the-box environment support and custom domain management. This can be a straightforward solution for a basic Next.js application.

Interviewer: Our application, however, makes use of a database, file system, and a cache. Do you think Vercel is the best option from a cost perspective?

Me: You're right. Considering the complexity of the application with the additional components, using Vercel may not be the most cost-effective solution. In this case, I would suggest leveraging AWS services directly.

💡

With his second question, it became clear that the interviewer was more focused on the DevOps plan than on the technology side or principles of distributed systems. I aimed to provide the simplest possible answer, starting with a brute force approach, even though I recognized it might not be the best solution.

💡

As he continued with subsequent questions, he introduced additional components to the application, making it apparent that choosing Vercel was no longer a viable option.

First Solution

Me: May I assume the database to be an RDS?

Interviewer: Be my guest!

Me: I propose the following as the initial solution:

An EC2 instance for hosting the Next.js application.
Aurora DB, SQS, and EFS (Elastic File System) as managed AWS services. Being managed services, Aurora, EFS, and SQS provide high availability.
Route 53 for DNS management.
A GitHub workflow to push the latest build to the EC2 instance.

Interviewer: This solution has quite a few problems. Firstly, while the managed services are highly available, the EC2 instance isn’t. My traffic follows a pattern where there are only a few hours in the day when the application experiences very high traffic—almost 10 times the usual volume. How should I proceed with a cost-effective solution?

💡

The interviewer clearly wanted me to provide a cloud provider solution, as he introduced conditions to advance the discussion. His initial feedback on my solution was complemented by a second, where he acknowledged the high availability of the managed services but pointed out that a single EC2 instance lacks that same availability. This made it evident that the application itself was not highly available.

💡

Additionally, he noted the traffic surge that occurs for only a few hours a day, which likely suggests he had a trading application in mind that operates solely during market hours. Simply adding more EC2 instances to improve availability isn’t a viable option; provisioning EC2 capacity to handle 10 times the usual traffic during peak hours would not be cost-effective.

Challenge 1: Scalability vis-a-vis cost

Me: Since we want to enhance the scalability of our application in an on-demand manner, I propose using an ECS Fargate cluster, as illustrated in the diagram below:

Explanation:

The Aurora DB, EFS, and SQS will continue to be managed AWS services, ensuring their availability.
ECS (Elastic Container Service) allows you to run containers on demand. During periods of high traffic, ECS services will automatically spin up tasks (containers), and as the load decreases, the extra containers will be terminated. Billing is based on usage, and the underlying hardware is fully managed by AWS, alleviating concerns about infrastructure management.
ECS can also be run on EC2 instances; however, the cluster requires designated EC2 instances to operate. Given the disproportionate traffic spikes, Fargate is the chosen solution for this scenario.
ECS can connect directly to ECR (Elastic Container Registry), where the built image can be stored and accessed through the GitHub workflow.

Interviewer: While the scalability aspect is addressed, I still don’t see how the application is highly available. Additionally, do you believe the security aspects are adequately handled?

💡

The interviewer was correct in noting that availability has not yet been addressed; the ECS cluster would still operate within the same Availability Zone (AZ), posing a risk of downtime. I overlooked this important aspect.

💡

He also hinted at the necessity of considering the security aspects of the deployed application in the public cloud.

Challenge 2: Security and Availability

Me: My apologies! I overlooked the availability aspect, and I understand that security must also be addressed. I propose the following diagram:

Explanation:

A VPC belongs to a region, and we can have subnets (public/private) within a single Availability Zone (AZ). By deploying two ECS clusters in different private subnets, we can enhance the application's availability.
An Application Load Balancer (ALB) is a highly available managed AWS service that can operate across multiple public subnets (and thus multiple AZs). The load balancer will distribute traffic in a round-robin manner across the two clusters.
Since the ECS clusters are now in private subnets, their access to the ECR repository and managed services like Aurora DB, SQS, and EFS is restricted. Therefore, the VPC must include the following endpoints:
1. A Gateway Endpoint for accessing the managed Aurora DB service.
2. An Interface Endpoint for accessing the managed SQS service over a private link.
3. An EFS mount target for connecting to the EFS.
4. An ECR Endpoint for accessing the private ECR repository.
The VPC will be connected to an Internet Gateway (IG) to allow external traffic to reach the ALB in the public subnets. As the first point of contact, the ALB will handle SSL offloading, and Route 53 will resolve to the ALB's Elastic IP address.

Interviewer: I see that most of the issues are resolved. Is there a way to fetch a report for download using our app that can only be created on our on-premises servers due to confidentiality reasons?

💡

The interviewer is introducing an additional condition related to connecting with the on-premises data center.

Challenge 3: Integration with on Prem Data center

Me: To accommodate the added condition of connectivity with the on-premises data center, we can utilize AWS Direct Connect. The following diagram elaborates on this idea:

Explanation: The VPC connects to the on-premises data center using AWS Direct Connect.

Further

The article introduces numerous AWS services and concepts; however, the architecture has been overly simplified for brevity, and many minor details have been omitted. For example:

SSL on the ALB may be managed automatically by AWS Certificate Manager (ACM), which handles renewals upon expiry. Additionally, the ALB will require an Elastic IP address.
Many flows have been simplified for clarity, and what is presented in the diagrams offers a high-level view of the overall architecture. For instance, the connection from GitHub to ECR would involve additional intricacies. If an Infrastructure as a Service (IaaS) tool like Terraform is required, more components and services would be necessary in the DevOps flow.
Moreover, the subnets within the VPC will need their own route tables and security groups, which have been omitted in the diagrams for the sake of brevity.
There is no one way of doing things and the same problem could have been solved using EKS (Elastic Kubernetes Service) however using Kubernetes for a single next JS application seems overkill at first sight. ECS fits the bill perfectly.

Conclusion

In this article, we explored how enterprise-level deployments are structured for a Next.js application. We discussed various AWS services, including ECS, services, tasks, VPC, subnets, and the different endpoints such as gateway and interface endpoints. Additionally, we examined AWS Direct Connect, which facilitates the connection between an AWS VPC and an on-premises data center.

Throughout the interview process, we realized that providing minimal input allows the interviewer to guide the discussion effectively. A skilled interviewer is typically adept at prompting the interviewee to elicit the answers they are seeking.

I hope this article has offered you valuable insights into the enterprise deployment strategies employed by large organizations. Even a small application built with Next.js requires robust infrastructure to operate in a scalable, available, and secure manner.

Tenets of Multithreading in GO: Detailed Tutorial

Siddhartha S — Fri, 14 Mar 2025 05:57:49 GMT

Introduction

Multiprocessing or multithreading is a critical aspect of many compiled languages, and go (often referred to as Golang) is no exception. Go began development around 2007-08, a time when chip manufacturers recognized the benefits of using multiple processors to increase computational power, rather than merely boosting the clock speed of single processors. While multithreading existed before this period, its real advantages became apparent with the advent of multiprocessor computers. In this article, we will review the various tools that Go provides for writing robust multiprocessing code. We will also explore simple code examples to help us grasp the concepts these tenets offer.

Background

Multiprocessing and Multitasking are two distinct concepts, and most languages supporting multithreading conceal the details within their concurrency frameworks. As a programmer, you might not be aware if a new OS thread is being created under the hood. Additionally, system threads differ from managed or green threads that are created by the code.

You may be a seasoned developer who is well-versed in the concepts of multitasking and multiprocessing. However, I strongly recommend reading through this article to clarify any grey areas you may have. This will also ensure that we (you and I) are on the same page as we proceed. Another request is to have access to the Go playground if you wish to try out the code provided in the article. I have endeavored to keep the code extremely simple so it can be tried on the fly without needing to set up a Go development environment.

I am also assuming that you are here specifically for multithreading in go and you are otherwise ok with go syntax.

Goroutines

A Goroutine is a green thread built into the Go language. Green threads (also known as managed threads or user threads) allow thousands of them to run simultaneously in a Go program. The complexities of memory management are handled seamlessly by the Go runtime.

Let’s start with a simple program that does not use a Goroutine.

package main

import (
  "fmt"
  "time"
)

func main(){
  start := time.Now()
  doFakeApiCall()
  fmt.Println("\nFinished")
  elapsed := time.Since(start)
  fmt.Printf("\nProcesses took %s", elapsed)
}

func doFakeApiCall(){
  time.Sleep(time.Second*2);
  fmt.Println("\nAPI call took two seconds!")
}

If you run this code in the Go playground, you will see the following output, which is clear and easy to understand:

API call took two seconds!
Finished
Processes took 2s
Program exited.

However, if we place a go keyword in front of the doFakeApiCall() function, the program will exit before the doFakeApiCall() function completes. As a result, you will get:

Finished
Processes took 0s
Program exited.

This occurs because the doFakeApiCall() function call is now happening asynchronously. The main function does not wait for it to return and proceeds to the next instructions, causing the main function to exit before doFakeApiCall() has completed. Since we cannot predict the duration of asynchronous processing, we need to implement a mechanism in the main function that ensures it waits until the Goroutine returns. We will explore this in the next section.

Wait Group

We concluded the last section with a problem. Below is a Go program similar to the previous one, but now it includes an additional fake database call function.

package main

import (
   "fmt"
   "sync"
   "time"
)

func main(){
   start := time.Now()
   go doFakeApiCall()
   go doFakeDbCall()
   fmt.Println("\nFinished")
   elapsed := time.Since(start)
   fmt.Printf("\nProcesses took %s", elapsed)
}

func doFakeApiCall(){
   time.Sleep(time.Second*2);
   fmt.Println("\nAPI call took two seconds!")
}

func doFakeDbCall(){
   time.Sleep(time.Second*1);
   fmt.Println("\nDB call took one second!")
}

As you may have guessed, the program will exit before the doFakeApiCall() and doFakeDbCall() functions complete. Here, WaitGroup from the sync package comes to the rescue. WaitGroup provides several methods that help synchronize the completion of concurrent tasks:

Add(): Sets the number of concurrent tasks to handle.
Done(): Decrements the count of concurrent tasks.
Wait(): Blocks until the counter reaches zero, allowing the application flow to proceed.

Let’s amend our previous code to make use of WaitGroup.

package main


import (
   "fmt"
   "sync"
   "time"
)

var wg = sync.WaitGroup{}

func main() {
   wg.Add(2) //Add two tasks to the waitgroup
   start := time.Now()
   go doFakeApiCall()
   go doFakeDbCall()
   wg.Wait() //Wait for all tasks to finish.
   fmt.Println("\nFinished")
   elapsed := time.Since(start)
   fmt.Printf("\nProcesses took %s", elapsed)
}

func doFakeApiCall() {
   time.Sleep(time.Second * 2)
   fmt.Println("\nAPI call took two seconds!")
   wg.Done() //reduce waitgroup counter by 1
}

func doFakeDbCall() {
   time.Sleep(time.Second * 1)
   fmt.Println("\nDB call took one second!")
   wg.Done() //reduce waitgroup counter by 1
}

The output for this program will be as follows:

DB call took one second!
API call took two seconds!
Finished
Processes took 2s
Program exited.

You may have noticed that the DB call finished first, despite being called after the API call. The total time taken for both the doFakeApiCall() and doFakeDbCall() calls is 2 seconds. If these were called synchronously, they would have taken 3 seconds.

While Wait Groups are effective for running independent goroutines, they do not facilitate communication between them. In Golang, channels are constructs used for sharing memory between goroutines. Below are some syntax details for channels:

ch := make(chan ): Creates a channel.
ch <- someData: Publishes data to the channel. This call is blocking, meaning control won't proceed until the data is read from the channel.
someVar := <-ch: Reads from the channel into a variable. This is also a blocking call, waiting for data to become available.
close(ch): Closes a channel.

Building on the example from the last section, let's consider an additional goroutine, doCreateFakeReport, which creates a report from the outputs of doFakeDbCall and doFakeApiCall.

package main


import (
   "fmt"
   "time"
   "sync"
)

func main() {
   start := time.Now()
   apiCh:=make(chan string)
   dbCh:=make(chan string)
   var wg sync.WaitGroup
   wg.Add(1)

   go doFakeApiCall(apiCh)
   go doFakeDbCall(dbCh)
   go doCreateFakeReport(apiCh,dbCh,&wg)

   wg.Wait()
   fmt.Println("Finished")
   elapsed := time.Since(start)
   fmt.Printf("Processes took %s", elapsed)
}

func doFakeApiCall(apiChan chan string) {
   time.Sleep(time.Second * 2)
   fmt.Println("API call took two seconds!")
   apiChan<-"API 123"
}

func doFakeDbCall(dbChan chan string) {
   time.Sleep(time.Second * 1)
   fmt.Println("DB call took one second!")
   dbChan<-"DB 123"
}

func doCreateFakeReport(apiChan chan string,dbChan chan string, wg *sync.WaitGroup){
   fmt.Println("Final Report: Api result is", <-apiChan , "Db Result is:", <-dbChan)
   wg.Done()
}

In this code, we are:

Creating channels for doFakeDbCall and doFakeApiCall to publish their results.
Passing these channels to the doCreateFakeReport function to read from them.
Using a Wait Group to wait until doCreateFakeReport finishes.

The output of this code is as follows:

DB call took one second!
API call took two seconds!
Final Report: Api result is API 123 Db Result is: DB 123
Finished
Processes took 2s
Program exited.

Buffered Channels

Buffered channels allow you to publish a predetermined number of items to the channel before it blocks for reading. Here is an example of a buffered channel:

package main


import (
   "fmt"
   "time"
)

func main(){

   ch:=make(chan int, 3)

   for i := 0; i < 3; i++{
           go func(){
               time.Sleep(time.Second*1)
               ch<- i
           }()
       }

   fmt.Println("Channel published: ", <-ch)
   fmt.Println("Channel published: ", <-ch)
   fmt.Println("Channel published: ", <-ch)

   fmt.Printf("Exiting")
}

Output:

Channel published:  0
Channel published:  1
Channel published:  2
Exiting
Program exited.

Using buffered channels is preferred when you know the number of goroutines that will be launched.

Select Pattern for Channels

A common pattern for reading from multiple channels is using a select statement within an infinite loop. Here's an example:

package main


import (
    "fmt"
    "os"
    "time"
)

func main() {
    ch1 := make(chan string)
    ch4 := make(chan string)
    endCh := make(chan int)


    go doPublish1(ch1)
    go doPublish4(ch4)
    go doEndOn10(endCh)


    for {
        select {
        case msg := <-ch1:
            fmt.Println(msg)
        case msg := <-ch4:
            fmt.Println(msg)
        case msg := <-endCh:
            fmt.Println(msg)
            os.Exit(0)
        }
    }
}

func doPublish1(ch chan string) {
    for {
        time.Sleep(time.Second * 1)
        ch <- "Publishing every 1 second"
    }
}

func doPublish4(ch chan string) {
    for {
        time.Sleep(time.Second * 4)
        ch <- "Publishing every 4 seconds"
    }
}

func doEndOn10(ch chan int) {
    for {
        time.Sleep(time.Second * 10)
        fmt.Println("Sending end signal")
        ch <- 0
    }
}

The first goroutine, doPublish1, publishes to its channel every 1 second.
The second goroutine, doPublish4, publishes to its channel every 4 seconds.
The last goroutine waits for 10 seconds before publishing to its channel, and once it does, the program exits.

The output in the Go playground is as follows:

Publishing every 1 second
Publishing every 1 second
Publishing every 1 second
Publishing every 1 second
Publishing every 4 seconds
Publishing every 1 second
Publishing every 1 second
Publishing every 1 second
Publishing every 1 second
Publishing every 4 seconds
Publishing every 1 second
Publishing every 1 second
Sending end signal
0

Synchronization

There may be situations where shared code is accessed by multiple goroutines, which can lead to race conditions. Golang offers several constructs to help with the synchronization of shared code.

For example, the following code demonstrates a clear case of a race condition:

package main


import (
  "fmt"
  "sync"
)

func main(){
   counter:=0
   var wg sync.WaitGroup
   wg.Add(2)

   go func(){ //Increments counter by 30000
       for i := 0; i < 30000; i++{
               counter++
       }
       wg.Done()
   }()

   go func(){ //Decrements counter by 30000
       for i := 0; i < 30000; i++{
               counter--
       }
       wg.Done()
   }()

   wg.Wait()
   fmt.Println("Final counter value: ",counter)
}

This code simply increments and decrements a counter using two separate goroutines. The program waits for the goroutines to finish executing and then prints the counter's value. Because the counter is a shared variable between the goroutines, this creates a race condition. Every time you run this code in the Go playground, you may notice different values for the counter, including 0.

In the following sections, we will explore the constructs provided by Go to resolve race conditions arising from shared resources.

Mutexes

A mutex, short for mutual exclusion, helps with code synchronization by ensuring that only one goroutine can access a portion of the code that can lead to a race condition. The previous example can be corrected as shown below:

package main


import (
  "fmt"
  "sync"
)

func main(){
   counter:=0
   var wg sync.WaitGroup
   var mut sync.Mutex
   wg.Add(2)

   go func(){ //Increments counter by 30000
       for i := 0; i < 30000; i++{
               mut.Lock()
               counter++
               mut.Unlock()
       }
       wg.Done()
   }()

   go func(){ //Decrements counter by 30000
       for i := 0; i < 30000; i++{
               mut.Lock()
               counter--
               mut.Unlock()
       }
       wg.Done()
   }()

   wg.Wait()
   fmt.Println("Final counter value: ",counter)
}

Running the above code will consistently yield 0. Note that the code between the lock and unlock statements of the mutex is referred to as the critical section.

Atomic Variables

While mutexes help achieve synchronization, they introduce some boilerplate code. Atomic operations can enhance brevity. The previous example can be rewritten using atomic operations as follows:

package main

import (
  "fmt"
  "sync"
  "sync/atomic"
)

func main(){
   var counter int32 = 0
   var wg sync.WaitGroup
   wg.Add(2)

   go func(){ //Increments counter by 30000
       for i := 0; i < 30000; i++{
              atomic.AddInt32(&counter,1)
       }
       wg.Done()
   }()

   go func(){ //Decrements counter by 30000
       for i := 0; i < 30000; i++{
              atomic.AddInt32(&counter,-1)
       }
       wg.Done()
   }()

   wg.Wait()
   fmt.Println("Final counter value: ",counter)
}

In this example, the following code:

mut.Lock()
counter--
mut.Unlock()

has been replaced with:

atomic.AddInt32(&counter,1)

Condition Variables

In the previous code examples, while synchronization ensures the correct final result, we also need to impose a condition during the intermediate stages. Specifically, in the example from the last section, we want to ensure that the counter never goes below zero at any point during program execution.

Below is the amended code that modifies the mutex example to ensure that the value of the counter never falls below 0:

package main

import (
  "fmt"
  "sync"
)

func main(){
   counter:=0
   var wg sync.WaitGroup
   var mut sync.Mutex
   var counterChecker = sync.NewCond(&mut)
   wg.Add(2)

   go func(){ //Increments counter by 30000
       for i := 0; i < 30000; i++{
           mut.Lock()
           counter++
           fmt.Println(counter)
           counterChecker.Signal()
           mut.Unlock()
       }
       wg.Done()
   }()

   go func(){ //Decrements counter by 30000
       for i := 0; i < 30000; i++{
           mut.Lock()
           if counter-1 < 0{
               counterChecker.Wait()
           }
           counter--
           fmt.Println(counter)
           mut.Unlock()
       }
       wg.Done()
   }()

   wg.Wait()
   fmt.Println("Final counter value: ",counter)
}

In this code, the decrement function waits on the conditional variable whenever it is about to decrease the counter into the negative range. Conversely, the increment function sends a signal each time it increments the counter.

This mechanism ensures that the counter's value is maintained above zero throughout the program's execution.

Worker Pool

The worker pool is a fundamental Go concurrency pattern that facilitates the creation of efficient pipelines. In Go, a pipeline is constructed using channels where one set of goroutines feeds data into a channel, while another set processes and offloads it.

The worker pool pattern represents the most basic form of a pipeline. It helps manage computational resources on a machine while leveraging the benefits of multiprocessing. This pattern is particularly useful for controlling the rate of resource consumption and maintaining system stability under heavy loads.

Implementation:

A fixed number of worker goroutines are initialized.
These workers continuously dequeue items from a feed channel.
After processing, workers send results to a results channel.
The results channel can serve as a feed channel for subsequent stages, creating a multi-stage pipeline.

package main


import (
   "fmt"
   "time"
   "strconv"
)

func main() {
   numWorkers := 3
   numJobs    := 10

   jobs := make(chan int, numJobs)
   results:=make(chan string, numJobs)

   // Start worker goroutines
   for w := 1; w <= numWorkers; w++ {
      go worker(w, jobs, results)
   }

   // Send jobs to the jobs channel
   for j := 1; j <= numJobs; j++ {
      jobs <- j
      fmt.Println("Produced job", j)
   }

   close(jobs) // Close the channel to indicate no more jobs will be sent

   for r := 1; r<= numJobs; r++{
       fmt.Println(<-results)
   }
   fmt.Println("All jobs have been processed.")
}

// Worker function processes jobs from the jobs channel
func worker(id int, jobs <-chan int, results chan <- string) {
   fmt.Println("Waiting in worker")
   for job := range jobs {
      // Simulate doing some work
      fmt.Println("Worker", id, "started job", job)
      time.Sleep(time.Millisecond * 1500) // Simulate work duration
      results<- "Worker " + strconv.Itoa(id) + " finished job " + strconv.Itoa(job)
   }
}

Code Breakdown:

Channel Initialization:
- Two buffered channels are created: jobs for incoming tasks and results for processed outputs.
- Buffered channels prevent blocking, allowing for smoother operation.
Worker Goroutine Deployment:
- A specified number of worker goroutines are launched.
- Each worker function takes the jobs channel as input and the results channel as output.
Job Distribution:
- The main function populates the jobs channel with tasks.
- After all jobs are queued, the jobs channel is closed to signal completion.
Job Processing:
- Workers continuously pull jobs from the jobs channel using a range loop.
- Each job is processed (simulated with a time delay in this example).
- Results are sent to the results channel.
Result Collection:
- The main function retrieves and prints results from the results channel.
- The program exits after processing all results.

This pattern demonstrates effective concurrent processing, load balancing across multiple workers, and controlled resource utilization. It serves as a foundation for more complex concurrent systems and pipelines in Go.

Conclusion

In this article, we explored how goroutines in Go enable concurrent execution and help optimize the use of computational resources. We covered the various concurrency primitives provided by the Go programming language, including wait groups, channels, buffered channels, and mutexes. We also delved into the use of conditional variables and atomic operations.

A key tenet of Go's concurrency philosophy is "communication by sharing memory, not sharing memory by communication." We discussed how this principle guides the design of Go's concurrency features and promotes effective coordination between concurrent entities.

Additionally, we examined the select pattern, which allows for handling multiple channels simultaneously, and the worker pool model, a fundamental concurrent design pattern. The worker pool pattern demonstrates efficient load distribution and resource management, making it a valuable tool in building scalable and high-performance concurrent systems.

Through these discussions, I believe this article has significantly expanded your knowledge of Go's concurrency capabilities and equipped you with a stronger set of tools to leverage the power of concurrent programming in your Go projects.

The basics of Async and Parallel Programming

Siddhartha S — Fri, 07 Mar 2025 05:45:11 GMT

Introduction

Asynchronous programming is an essential part of modern software development, regardless of the programming language you use. It's crucial for optimizing hardware usage and improving application performance. However, in my day-to-day work, I've noticed that terms like async, parallel, and concurrency are often used interchangeably, leading to confusion. This article aims to address this issue for three main reasons:

To explore the history and establish clear distinctions between these various terms.
To incorporate relevant academic concepts from Computer Science, helping to connect the dots.
To create a foundational article that will serve as a basis for future discussions on multithreaded programming.

In this article, we'll examine the buzzwords async, parallel and concurrency clarifying their meanings and how they should be used. We'll also compare how these concepts are implemented across different programming languages, providing a comprehensive overview of these critical programming paradigms.

Moore’s Law and a little bit of History

Moore's Law, an empirical observation rather than a physical law, states that the number of transistors in an integrated circuit (IC) doubles approximately every two years. This principle has largely held true since its inception in 1965.

For many years, this increase in transistor count directly translated to higher clock frequencies in single processors, as illustrated in the figure below.

(Source: https://github.com/karlrupp/microprocessor-trend-data)

However, as highlighted in the figure, a significant shift occurred around 2008. Chip manufacturers stopped increasing clock speeds, despite the continued growth in transistor count. This change was driven by the realization that higher clock frequencies led to increased heat generation and power consumption, making it challenging to use these chips in smaller computers like laptops. The need for larger cooling systems and reduced battery life made it difficult to manufacture compact, portable devices.

To address these issues while still adhering to Moore's Law, chip makers began increasing the number of logical cores instead of clock speeds. This shift marked a turning point in processor design, focusing on multi-core architectures to boost computational power rather than relying solely on faster single-core processors.

💡

Interestingly, this shift parallels the preference for horizontal scaling over vertical scaling in computing infrastructure, where adding more machines (cores) is often favored over increasing the power of a single machine (core).

This transition to multi-core processors has had profound implications for software development, necessitating new approaches to take full advantage of the available computational resources.

Concurrency

Concurrency is a broad term often used interchangeably with multitasking or delegation. Let's explore this concept using an everyday example.

Imagine a morning schedule with the following tasks:

Wash dishes (30 minutes)
Do laundry (30 minutes)
Cook lunch (60 minutes)
Shower and prayer (30 minutes)

Completing these tasks sequentially would take 30 + 30 + 60 + 30 = 150 minutes or 2 hours and 30 minutes.

Async / Multitasking

In our example, we can observe opportunities for multitasking:

Dishwashing: 15 minutes to load/unload, 15 minutes of independent machine operation
Laundry: 15 minutes to load/unload, 15 minutes of independent machine operation
Cooking lunch: Requires full engagement
Showering and prayer: Requires full engagement

A more efficient approach would be:

Load the dishwasher
While dishes are washing, load the laundry
Start cooking lunch
Respond to signals from the dishwasher and washing machine to unload when ready
Resume cooking
After cooking, shower and pray

This multitasking approach saves about 30 minutes from the original schedule.

Multi Processing/ Parallelism

To save even more time, we could outsource lunch preparation to a nearby eatery that delivers home-cooked meals. This parallel processing reduces the total time to 1 hour and 15 minutes—a 50% reduction from the original schedule.

Key Points:

The 50% time reduction was achieved through both multitasking (async) and multi-processing (parallelism).
The entire optimized morning routine is an example of concurrent operations.
Both async and parallelism are subsets of concurrency.

Correlating above example with programming

Asynchronous Operations:

IO operations (e.g., reading from disk or socket) are analogous to doing dishes and laundry. When the processor initiates a disk read, it delegates to a disk controller and can perform other tasks while waiting. Once the operation completes, an interrupt (similar to a machine's signal) notifies the processor that the data is ready.

Parallel Processing:

CPU-intensive calculations benefit from additional processors, much like outsourcing lunch preparation. For example, a desktop app might perform intensive calculations on a background thread using a separate processor, while the main thread keeps the app responsive on another processor.

This real-world analogy helps illustrate how async operations, parallel processing, and concurrency work together in modern computing to optimize resource usage and improve overall efficiency.

Limitations of Concurrency

We've observed that concurrency encompasses both multitasking and parallelism, achieved through:

Multitasking on a single processor
Parallel computation on multiple processors

While this may seem straightforward at first glance, it's actually quite complex. Let's delve deeper into our morning chores example to illustrate some key challenges:

Synchronization Requirements:

Even in our simplified scenario, there are frequent occasions that require synchronization:

The dishwasher and washing machine need attention when they signal completion.
Lunch delivery requires a response when the doorbell rings.

These synchronization points highlight the need for careful coordination in concurrent systems.

Limits of Parallelization:

There's a limit to how much parallelization can optimize a process. For instance, if we added two more helpers to our morning routine:

Helper 1: Load dishes, wait, unload dishes

Helper 2: Load laundry, wait, unload laundry

Eatery: Cook and deliver lunch

Self: Shower and pray

In this scenario, we would only save an additional 30 minutes. The entire process can't be shortened beyond 1 hour because cooking and delivering lunch remains the longest task. Notably, Helper 1 and Helper 2 spend a significant amount of time waiting idly by their respective machines.

Adding more helpers beyond this point would be inefficient, as they would have no tasks to perform and would sit idle.

Amdahl’s Law

Amdahl's Law focuses on the potential speedup of a program when part of it is improved or parallelized.

Speedup:

$$\frac{1}{(1 - P) + \frac{P}{N}}$$

Where:

P = Portion of the program that can be parallelized

N = Number of processors

(1 - P) = Portion that remains sequential

This law shows that the overall speedup is limited by the sequential part of the program. As N increases, the speedup approaches a maximum limit of 1 / (1 - P).

Gustafson’s Law

Gustafson's Law considers that as we get more computing resources, we tend to take on larger problems rather than just solving the same problem faster.

Scaled speedup:

$$N + (1 - N) \cdot s$$

Where:

N = Number of processors

s = Serial fraction of the program

This law suggests that the speedup can scale roughly linearly with the number of processors for many real-world problems.

Imagine a financial reporting system that needs to gather data from various financial data providers, process the data, and then combine the results into a comprehensive report.

Amdahl's Law would apply to the part of the program that fetches the data from the independent sources (e.g., stock prices, exchange rates, economic indicators). By parallelizing this data fetch, the program can speed up this part of the process.

Gustafson's Law would come into play as the program is able to handle more data sources and generate more comprehensive reports as the computing resources (processors) are increased. The program can then take on larger and more complex financial reporting tasks.

By understanding both Amdahl's Law and Gustafson's Law, the developers of the financial reporting system can optimize the program's performance and scalability, ensuring that it can efficiently handle increasing amounts of data and reporting needs.

Green Thread, Managed Thread, User Level Thread

Q : Why are green threads called green threads?

A: Because they are green 😛

Q: Isn’t the processor color blind? 🤔

A thread is the basic unit of execution that can run on a processor. Computers with multiple processors can run as many threads as there are CPUs or processors.

A processor takes a thread from the ready queue, processes it for a short duration, and then switches to the next thread. This rapid switching between threads from the ready queue creates the illusion of parallel execution, even on a single-processor computer.

There are several reasons that cause a processor to change context and pick a new thread. These include time slice expiration, preemption, and priority-based scheduling. However, one major reason is I/O operations. If the processor detects that a thread is waiting for an I/O operation, it will immediately release that thread and pick another one from the ready queue.

This approach works well for a finite number of threads, but as the number of threads grows very high, the time spent on context switching can increase drastically.

To address this issue, modern programming runtimes like Golang and C# have introduced the concept of green threads or managed threads. When you create a thread in these programming languages, it is not an actual OS-level thread (also known as a kernel thread) that is created. Instead, it is a green thread or managed thread, which is a user-level thread.

The runtime creates the actual OS-level threads and then assigns the green/managed threads to these OS threads. In this way, a single OS thread can run multiple green/managed threads within it.

When a green/managed thread encounters an I/O operation, the runtime creates a new OS thread and assigns the remaining green/managed threads to it. This optimization of running multiple green/managed threads within a single OS thread helps reduce the overhead of context switching, which becomes increasingly important as the number of threads grows.

Conclusion

In this article, we explored the evolution of computing systems and the importance of concurrency to improve computational performance beyond the limitations of increasing clock frequencies on single processors.

We learned that multitasking and multiprocessing are distinct but related concepts that fall under the broader umbrella of concurrency. While JavaScript, a single-threaded language, can support multitasking through techniques like event-driven programming, modern languages like C# and Golang offer robust support for multiprocessing by leveraging parallel operations across multiple CPUs.

A key concept we discussed was the difference between OS-level threads (also known as kernel threads) and user-level threads, often referred to as green threads or managed threads. This distinction is important, as it allows programming runtimes to optimize thread management and reduce the overhead of context switching, particularly as the number of threads grows.

Whether you are a seasoned developer or new to the world of concurrency, I hope this article has provided clarity and added valuable knowledge to your understanding of these fundamental computer science concepts. Even if you were already familiar with the topics covered, I trust that this article has helped solidify your grasp of the subject matter and perhaps even challenged your previous assumptions.

Thank you for taking the time to read this article. I appreciate your engagement and hope that the insights shared here will prove useful in your future endeavors.

The internals of TCP: A deep dive

Siddhartha S — Fri, 07 Mar 2025 05:40:58 GMT

Introduction

TCP has been the foundational protocol for numerous higher-level protocols, such as HTTP and WebSockets, due to its guarantee of data integrity. Imagine querying a database table and missing a few rows of data—that would be catastrophic. Yet, we make database calls without concern about missing data, thanks to TCP's integrity guarantees. In networking, balancing data integrity with latency is a challenge, especially given the unpredictable nature of the internet. We never know how many routers our network calls hop through, and despite many routers or switches potentially dropping packets, data loss is not an issue.

In the first article of this networking series, we briefly discussed TCP connections, examining how the OS handles them internally. We concluded with several open-ended questions. I recommend reviewing that article, as it provides an essential piece of the puzzle we aim to complete.

This article, I believe will clarify your understanding of TCP and dispel misconceptions about this enduring protocol that has stood the test of time for over the past four decades

IP Packet Header

In our previous discussion, we explored how IP packets exist at Layer 3 of the OSI model, while TCP segments operate at Layer 4. A TCP segment, along with its header, gets stuffed inside the IP Packet data and gets sent.

To fully understand the subsequent sections, it is important to familiarize ourselves with the IP packet header. Below is a diagram illustrating the IP packet header structure:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |Version|  IHL  |Type of Service|          Total Length         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Identification        |Flags|      Fragment Offset    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Time to Live |    Protocol   |         Header Checksum       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       Source Address                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Destination Address                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Source: RFC 791

Version: Specifies the IP protocol version, either IPv4 or IPv6.
IHL (Internet Header Length): Indicates the length of the header, including any options.
Type of Service: Defines aspects like priority and quality of service.
Total Length: Specifies the packet's total length, including the header and data.
Identification, Flags, Fragment Offset: These fields are used for packet fragmentation and reassembly if needed.
Time to Live (TTL): Limits the lifespan of a packet by defining the maximum number of hops it can make before being discarded. Each router decrements this value by one.
Protocol: Specifies the protocol used in the data portion (e.g., ICMP, TCP, UDP).
Header Checksum: Used to verify the integrity of the header data in IPv4 packets.
Source and Destination IP Address: Specifies the sender's and receiver's IP addresses. At Layer 3, only these IP addresses are involved in routing.

Understanding these fields will aid in navigating the complexities of network packet handling in further discussions.

TCP Connection

The Transmission Control Protocol (TCP) focuses on controlling data transmission, unlike the more lenient UDP. TCP is methodical about initiating, maintaining, and terminating data transmission. A crucial aspect is the TCP connection.

A TCP connection is bidirectional, enabling protocols like HTTP and WebSocket to operate effectively. It is initiated by the client and accepted by the server. A client is typically an application used directly by the user to initiate operations, while a server is a continuously available application that handles client requests.

Therefore, it's more practical for clients to identify the server rather than vice versa, aligning with software architecture, not any inherent TCP directionality.

TCP, a Layer 4 protocol with port visibility (see the first article), requires an established connection for data transmission. A TCP connection resembles an agreement between a client and server, identified by:

Client IP
Client Port
Destination IP
Destination Port

Establishing a TCP connection involves a three-way handshake: SYN, SYN-ACK, and ACK. TCP segments are sequenced and acknowledged. A delay in segment acknowledgment triggers a retransmission, which we will explore shortly.

Multiple connections can exist between the same client and server, with TCP segments multiplexed into a single stream. This stream is then demultiplexed and routed to the appropriate programs listening on relevant ports.

Connection Establishing

TCP connection establishment consists of three way handshake:

The sender sends a SYN request.
The receiver responds with a SYN/ACK message.
The initiator sends back an ACK message, finalizing the connection. Both sender and receiver now have sockets and file descriptors, signifying an established connection.

💡

For information on File Descriptors and Sockets, read this article.

Transmission of Data

With the connection established, the sender begins transmitting TCP segments. These segments are acknowledged by the receiver upon receipt.

The receiver may acknowledge multiple segments with a single acknowledgment. For instance, in the given scenario (refer the diagram above), the sender might send three segments numbered based on the initial SYN connection request sequence. The receiver acknowledges the last segment, implicitly acknowledging prior segments as well.

Re transmission of Data

For illustration:

The segment with the third sequence number is dropped.
The sender waits for acknowledgement and receives acknowledgment only for sequence 2.
After a timeout, the sender retransmits the segment marked with sequence 3.
The receiver then acknowledges sequence 3.

You might be thinking what would happen if sequence 2 got dropped while sequence 3 went through. Well that would result in Line of Head blocking and we will explore that in the last section of the article.

Connection Closure and connection States

While a TCP connection is established via a three-way handshake, closing it involves a four-way handshake. The connection state transitions as follows:

The sender initiates closure by sending a FIN request, entering the FIN WAIT state.
The receiver receives the FIN, sends back an ACK, and moves to the CLOSE WAIT state.
The sender receives the ACK, moving to the FIN WAIT 2 state.
The receiver enters the LAST ACK state, sending a FIN back.
The sender, on receiving the FIN, moves to the TIME WAIT state, sending a final ACK.
On receiving the last ACK, the receiver transitions the connection to the CLOSED state and waits (usually around 4 minutes) to ensure no more messages are incoming before finally moving to the CLOSED state.

The responsibility for waiting and closing the connection falls on the initiator, hence the recommendation for the client to initiate the connection. Additionally, the removal of sockets and file descriptors continues even after the connection is closed, as the OS independently manages resource disposal.

Anatomy of TCP Segment

The TCP header format is as follows:

Note: Each tick mark represents one bit position.

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Source Port          |       Destination Port        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Sequence Number                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Acknowledgment Number                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Data |           |U|A|P|R|S|F|                               |
| Offset| Reserved  |R|C|S|S|Y|I|            Window  Size       |
|       |           |G|K|H|T|N|N|                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Checksum            |         Urgent Pointer        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Options                    |    Padding    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             Data                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Source: RFC 793

The header consists of 5 × 4 = 20 bytes. Additional headers may be included as options. The Source Port and Destination Port fields specify the ports for the source and destination (the IP addresses are contained in the IP packet header).

The Acknowledgment Number is only relevant if the ACK flag is set. The Window Size field indicates the amount of data the receiver can handle (more on this in the Sliding Window section).

Notable are the 9-bit flags:

FIN: Indicates a connection closure request.
SYN: Initiates a sequence number for the initial handshake.
RST: Resets the connection.
ACK: Acknowledges received data.
URG: Marks urgent data.
ECE: Signals congestion notification.
CWR: Indicates congestion window reduction.
NA: Notification anomalies.

The relevance of these flags will become clearer as we explore more aspects of TCP.

Flow Control

Consider a scenario where the sender wants to transmit Segments 1, 2, and 3 to the receiver. Receiving a single acknowledgment for multiple segments is more efficient. However, the sender must have a way to know how many segments to send before waiting for an acknowledgment. Sending too many segments could overwhelm the receiver's buffer, leading to dropped segments.

This is where the Window Size field (refer to TCP segment anatomy) comes into play. Each acknowledgment from the receiver includes the current Window Size, informing the sender of how many packets can be sent.

💡

While the Window Size is significant, there are other factors influencing flow control that we will discuss in upcoming sections.

Receiver (Sliding) Window

The sliding window is a critical mechanism in TCP used by the sender. As illustrated, assume the sender wants to transmit six segments to the receiver.

The sender transmits Segments 1, 2, and 3.
The receiver acknowledges Segment 2.
The sender realizes Segment 3 is either still in transit or lost.
The window slides to include Segments 4 and 5, keeping Segment 3 within the window.
Segments 1 and 2 are excluded from the sender's window as they have moved out of the window and may be dropped.
The sender transmits Segments 4 and 5.
Upon receiving acknowledgment for Segment 3, the window slides further to include Segment 6 and remove Segment 3.

This sliding window mechanism ensures orderly data transmission and efficient use of network resources.

But what should be the size of this window?

Clearly, the number of segments that are to be sent to the receiver are to be included in the window. The Size of the window gets set with every acknowledgement that comes back (remember the Window Size field in the TCP header?).

Congestion Control

The flow of data from sender to receiver in TCP is not solely managed by flow control. Although flow control ensures that the receiver is not overwhelmed, the data must traverse several intermediary devices such as routers and switches. These network elements might not support rapid data flow even if the receiver can handle it. Therefore, TCP also incorporates congestion control to manage data transmission effectively.

In addition to the Receiver Window (RWND), TCP utilizes a Congestion Window (CWND), which plays a crucial role in congestion control. It’s important to note that the CWND can never exceed the RWND.

There are two primary algorithms for determining the size of the CWND:

TCP Slow Start: This algorithm gradually increases the CWND size to identify the network's capacity without causing congestion.
Congestion Avoidance: This algorithm aims to optimize data flow by adjusting the CWND size to avoid congestion once the initial network capacity has been identified.

Let's explore each of these algorithms in detail.

TCP Slow Start

Ironically, despite its name, TCP Slow Start is actually the fastest among the congestion control algorithms.

In the example illustrated above:

The Congestion Window (CWND) begins with a capacity of 1 segment. Accordingly, only one segment is sent initially.
Upon receiving acknowledgment from the receiver, the CWND is increased by 1.
The sender then transmits two segments: Segments 2 and 3.
In response, the receiver sends back acknowledgments for both segments (2 and 3). As a result, the CWND increases by 2 (1 for each acknowledgment received).
With the CWND now allowing for larger transmission, the sender proceeds to send Segments 4, 5, 6, and 7.

Congestion Avoidance

Congestion Avoidance algorithm also increases the CWND but at a slower rate than TCP Slow start.

In the above diagram example:

The Congestion Window (CWND) begins with the capacity for 1 segment, so only one segment is initially sent.
Upon receiving an acknowledgment from the receiver, the CWND is increased by 1.
The sender then transmits two segments: Segments 2 and 3.
The receiver sends back acknowledgments for both segments. The CWND is increased by 1 for this entire round trip, as it pertains to the single window of data sent.
With the updated CWND, the sender now sends Segments 4, 5, and 6.

Congestion Notification

Routers operate at Layer 3 of the OSI model, providing them visibility into IP packets. IP packets include a field known as ECN (Explicit Congestion Notification). When routers detect their buffers are becoming full, they mark the ECN field in the IP packets and forward them to the receiver. Upon receiving these packets, the receiver notes the ECN marking in the IP headers and communicates this information back to the sender. The sender will reduce the transmission rate to alleviate congestion.

Nagle’s Algorithm

Nagle’s algorithm specifies that if there are in-flight segments—meaning segments that have been sent but for which an acknowledgment (ACK) has not yet been received—an IP packet will only be transmitted if it is completely filled. Conversely, if there are no in-flight segments, a packet will be sent even if it is only partially filled.

In the diagram example provided:

Three completely filled IP packets are sent, while a partially filled packet is held back.
Once the acknowledgments for the three packets are received, the partially filled packet is then transmitted.

Nagle’s algorithm is often disabled in modern networking practices because it can introduce additional latency.

Delayed Acknowledgement Algorithm

The Delayed Acknowledgment Algorithm is implemented on the receiver's side. This algorithm suggests that the receiver should wait to receive multiple packets before sending an acknowledgment (ACK). By doing so, the number of acknowledgments sent is reduced, which can help decrease overall latency in the communication process.

A problem arises when both Nagle’s algorithm and the Delayed Acknowledgment Algorithm are used simultaneously. Nagle’s algorithm, which operates on the sender's side, holds back a packet until an acknowledgment (ACK) for previously sent segments is received. Meanwhile, the Delayed Acknowledgment Algorithm, implemented on the receiver's side, waits to receive multiple segments before sending back an ACK.

This combination can create a deadlock-like situation, where the sender is waiting for an ACK that is not being sent because the receiver is holding off until it receives additional segments. As a result, this can lead to retransmissions and timeouts, negatively impacting network performance.

TCP Head of Line blocking

TCP ensures that segments are delivered in the order they are sent. Line of Head Blocking occurs when a segment in the middle of a sequence gets dropped, which particularly impacts HTTP requests because they often use the same connection to send multiple requests.

Consider the diagram example illustrated above:

Request 1 is divided into Segments 1 and 2.
Request 2 is divided into Segments 3 and 4.

If Segments 2, 3, and 4 are successfully transmitted but Segment 1 is dropped, the server will not send acknowledgments (ACKs) for Segments 2, 3, and 4 until Segment 1 is retransmitted.

As a result, Request 2 suffers delays even though it was fully transmitted, because it is dependent on the status of Request 1.

This scenario exemplifies Line of Head Blocking, where the processing of one request is held up due to the loss of an earlier packet.

Conclusion

In this comprehensive article, we built upon the foundation established in a previous piece and delved into the internals of TCP communication. We began by examining the anatomy of IP packets and then explored the mechanisms involved in TCP connection creation and closure.

We discussed various mechanisms, such as Flow Control and Congestion Control, that TCP employs to ensure smooth and reliable data transmission. The sliding window technique facilitates controlled data flow, allowing for efficient management of both flow and congestion.

Additionally, we reviewed Nagle’s algorithm and the Delayed Acknowledgment Algorithm, both designed to reduce transmissions for improved efficiency. However, we noted that their combined use can lead to counterproductive outcomes. Finally, we addressed the concept of Line of Head Blocking, illustrating its impact on data transmission.

I trust that this article has contributed to your understanding and provided valuable insights into the intricacies of TCP internals.

Apache Kafka: Architectural Overview and Performance Mechanisms

Siddhartha S — Sun, 24 Nov 2024 04:30:30 GMT

Introduction

Kafka is a distributed event store and stream processing platform originally developed by LinkedIn for real-time processing. In 2011, Kafka was transferred to the Apache Software Foundation, and since then, countless software development teams have adopted it for various requirements. In this article, we will examine the various components of Kafka, how they function, and what enables Kafka to be highly available while handling millions of messages per second.

Components of Kafka

Topics

A Kafka topic can be likened to a database table; while it is not strictly a table, this analogy helps illustrate its function. Each topic is identified by a unique name and does not perform data validation like a database table. The sequence of messages within a topic is referred to as a data stream. Topics are immutable, meaning that once data is written, it cannot be deleted or modified.

Data in Kafka topics does not get removed after consumption, distinguishing Kafka from traditional message queues. Instead, the data is persisted in topics for a limited time, with a default expiry period of seven days that can be configured based on requirements. Topics are fed by producers, and the data within them is consumed by consumers, as we will explore shortly.

Partitions and Offsets

One reason Kafka achieves high throughput is due to the distribution of topics across partitions. A single topic can consist of multiple partitions. When a message is published to a topic, a component of Kafka known as the producer (covered later) determines which partition will receive the message. Users can also instruct producers to route data to a specific partition by providing a partition key.

Within each partition, messages are assigned an incremental identifier known as an offset. The offset is meaningful only within a specific partition; thus, offset 2 in partition 1 is unrelated to offset 2 in partition 2. This design ensures that order is guaranteed only within partitions, not across the entire topic. Additionally, offsets will not be reused, even after data has been deleted.

Producers

A producer is a client that writes data to Kafka topics, which are composed of partitions. Producers have the following key attributes:

They perform load balancing across partitions of a topic until a specific partition key is provided.
If a broker hosting the intended partition goes down while the producer is pushing messages, the producer can recover by shifting to a replicated partition. We will delve deeper into replication in Kafka later in this article.
Producers know which broker (a physical machine where a topic partition resides) contains the necessary partition for writing data. A producer employs partitioner logic to determine which partition to use for a record. This logic uses a key in binary format and the Murmur2 algorithm to hash the value, which identifies the target partition. The calculation follows this formula:
targetParition = Math.abs(Utils.murmur2(keyBytes)) % (numPartitions-1)

A Kafka message sent by the broker comprises the following components:

Key: In binary format and can be null.
Value: The message itself, also in binary format and nullable.
Compression Type: The supported compression types include none, gzip, snappy, lz4, and zstd.
Headers: In key-value format, these are optional.
Partition: Contains information about the partition where the message will be written.
Timestamp: This can be either user-defined or system-generated.

Kafka Message Serializer

Kafka topic partitions exclusively accept bytes from producers and return bytes to consumers (which we will discuss shortly). The key and value in each message are serialized before being pushed into the topics.

Common Serializers provided by Kafka:

String (including JSON)
Int, Float
Avro
Protobuf

Consumers

Consumers are clients that perform the opposite function of producers: they pull data from a topic. Data is read sequentially from low to high within each partition.

Consumer Deserializer

Just as producers have serializers, consumers utilize deserializers. Common deserializers include:

String (including JSON)
Int, Float
Avro
Protobuf

It is important to note that a Kafka topic cannot change its serialization or deserialization type throughout its lifecycle. The same serialization format must be used for both serialization and deserialization processes.

Consumer Groups

All consumers within an application read data as part of consumer groups.

💡

Each consumer within a group can read from multiple partitions, but a single partition cannot be consumed by more than one consumer at a time. Refer the diagram below.

A natural question arises: What happens if the number of consumers in a consumer group exceeds the number of available partitions?

In this case, the additional consumers will exist but remain inactive, as illustrated in the diagram below:

A topic can be connected to multiple consumer groups without any issue.

To create distinct consumer groups, the Kafka consumer framework provides a property called group ID For example, you might have a notification service and a dashboard service that both listen to a topic called truck_gps_location While the dashboard service updates a geographic map on the dashboard, the notification service is responsible for raising alerts for interested users via email or text. These two services belong to different consumer groups.

Consumer Offsets

Kafka stores the read offsets for each consumer group in an internal topic called __consumer_offsets. This mechanism ensures that if a consumer in a group fails, Kafka allows it to resume data retrieval from the last position it accessed.

Consumers need to periodically commit the read offsets in Kafka, which can be done either automatically or manually. If committed manually, there are three delivery semantics to consider:

At Least Once (usually preferred): Offsets are committed after the message is processed. If an error occurs during processing, the message will be read again, which might lead to duplicate processing. Therefore, consumers should handle messages idempotently.
At Most Once: Offsets are committed immediately upon receiving the messages. If an error occurs during processing, some messages may be lost and will not be read again.
Exactly Once: This approach is recommended when there is a need to read from a topic and then write back to it. The Transactional API of Kafka can be used for this purpose.

Kafka Brokers and Topic Replication Factor

Kafka is a distributed software platform that spans multiple nodes or servers, referred to as brokers. The partitions for a topic are distributed across different brokers to facilitate horizontal scaling, and they are also replicated to ensure high availability. The replication factor of a topic determines how many replicas of each partition will be stored.

For example, consider a Kafka cluster with three brokers and a replication factor of 2.

As illustrated in the above diagram, the partitions for both topics are distributed across the three brokers, with each partition's replication also stored across the brokers. This design allows Kafka topics to be both distributed and highly available.

When a client connects to any broker within the Kafka cluster, the broker acts as a bootstrap broker. It provides the client with information about all the other brokers, their addresses, and the partitions stored on them, allowing the client to know which broker to connect to for the specific partition it requires.

Regarding the replicas of a partition, only one copy—the leader replica—will accept writes from producers. Producers send messages only to the leader partition. Consumers, on the other hand, read from the closest replica of the partition.

ZooKeeper / KRaft

ZooKeeper, a distributed system in its own right, was initially used to coordinate the brokers in a Kafka cluster, enabling them to manage events such as new topic creation, leader elections across partitions, and broker failures (both the death of a broker and its recovery). However, ZooKeeper faced scaling issues after approximately 100,000 partitions. Consequently, starting with Kafka version 4.0, ZooKeeper has been deprecated and replaced with KRaft, which is a Raft implementation of Kafka itself, allowing a Kafka cluster to scale to millions of partitions.

Conclusion

In this article, we examined the various components of Apache Kafka and gained an understanding of how they function. At a high level, we explored the factors that make Kafka both scalable and highly available. We began with topics and delved into components such as partitions. We also discussed producers, consumers, and consumer groups, highlighting the anatomy of a Kafka message sent by the producer to the partitions and how consumer groups allow different services to read from the same topic for various business purposes.

Kafka is continually being enhanced for performance improvements and remains a highly sought-after technology for high-speed event processing. I hope this article has clarified many of your questions about Kafka and has instilled confidence in you to consider it for your next project.

Demystifying DNS: Understanding Domain Name Resolution

Siddhartha S — Sat, 23 Nov 2024 04:30:17 GMT

Introduction

The route to reach a server hosted on the public internet is through its public IP address. If you know the IP address of a server, you can access it directly. However, there are two main drawbacks to this approach:

Hard to Operate: Remembering an IP address can be challenging, and they are subject to change. Common users who work with IP addresses often find it difficult to remember them and cope with frequently changing addresses. It’s easier to remember google.com than 142.250.182.14.
Not Scalable: High-demand servers, like Google, need to have hundreds of servers located around the globe. The Google server that caters to a user in Australia does not serve a user in North America, meaning there is not just one IP address.

To address these problems, the Domain Name System (DNS) is utilized. In this article, we will take a detailed look at how DNS resolution occurs under the hood.

DNS Resolution Under the hood

DNS is a protocol built on top of UDP, a layer 4 protocol. You may refer to the first article in this series for a more in-depth understanding of network layers. The preferred port for DNS is 53 (similar to port 80 for HTTP).

The process begins when you type google.com into your browser's address bar. The following steps will occur (refer diagram above):

Hit the Resolver: If the browser does not have the IP address of google.com in its local cache, it will send a request to the resolver server for the IP address of google.com. This is a sync request. Resolver server addresses are configured within your network. Once you connect to a network that has internet access, the resolver is recognized by your machine through your ISP.
Resolver Hits Root: The resolver then queries a DNS root server. The address of the local root server is known to the resolver. There are over 13 root servers globally, ensuring high availability and reduced latency. The root server will provide the address of the top-level domain server (TLD), such as “.com.”
Resolver Hits the TLD: Next, the resolver contacts the TLD server, which, in this case, will be the .com server. This TLD server has records for all .com domain name registries. Like the root servers, TLD servers are also replicated worldwide. The TLD server returns the address of an authoritative name server.
Authoritative Name Server Response: The authoritative name servers return the IP address of the second-level domain that is closest to the user. This is also where load balancing occurs; for example, google.com will return different IP addresses of replicated servers in response to different requests to distribute the load effectively.

DNS packet header anatomy

A question that arises is: how does the resolver distinguish between the different requests that are coming in?

Examining the DNS request headers can help answer this question. A DNS packet header contains a Transaction ID, which assists in tracking the request. Additionally, it is worth noting that the resolver maintains a local cache to avoid contacting the root server for each request.

Streamlining Pagination in TypeScript: An Efficient Paginator Class

Siddhartha S — Fri, 22 Nov 2024 04:30:21 GMT

Introduction

Pagination is a crucial aspect of frontend engineering, and no frontend developer can claim to have avoided the need for it. There are standard methods for implementing pagination, and they are well understood. In this article, I present a paginator designed to abstract pagination into a separate layer, thereby keeping the viewing component code clean. Additionally, it utilizes caching to make fast navigation through pages.

About the paginator

Here’s how the final page utilizing the paginator looks:

You can find the entire code at:

👉

https://github.com/OxyProgrammer/efficient-paginator

Peeking into the Code

The paginator is a generic TypeScript class that provides on-demand iteration.

export default class EfficientPaginator {
  private currentPage: number;
  private pageSize: number;
  private hasMore: boolean;
  private cache: Map<number, FetchResponse>;
  private fetchFunction: FetchFunction;

  constructor(pageSize: number, fetchFunction: FetchFunction) {
    this.currentPage = 0;
    this.pageSize = pageSize;
    this.hasMore = true;
    this.cache = new Map();
    this.fetchFunction = fetchFunction; // Assign the fetch function
  }

  async getItems(direction: Direction): Promise {
    if (direction === Direction.Next) {
      this.currentPage++;
    } else if (direction === Direction.Previous && this.currentPage > 1) {
      this.currentPage--;
    }


    let cacheEntry = this.cache.get(this.currentPage);
    if (cacheEntry) {
      this.hasMore = cacheEntry.hasMore;
      return cacheEntry.users;
    }

    try {
      const response = await this.fetchFunction(
        this.currentPage,
        this.pageSize
      );
      this.cache.set(this.currentPage, response);
      this.hasMore = response.hasMore;
      return response.users;
    } catch (error) {
      console.error('Error fetching items:', error);
      return [];
    }
  }

  hasPrevious(): boolean {
    return this.currentPage > 1;
  }

  hasNext(): boolean {
    return this.hasMore;
  }

  getCurrentPage(): number {
    return this.currentPage;
  }

  getPageSize(): number {
    return this.pageSize;
  }
}

export interface FetchFunction {
  (page: number, pageSize: number): Promise>;
}

export interface FetchResponse {
  users: T[];
  hasMore: boolean;
}

An instance of EfficientPaginator is initialized with the page size and a fetchFunction.
The page size remains constant throughout the lifetime of this instance, as does the fetch function.
Caching is implemented using a map that takes the page number as a key and stores the returned items as an array of items returned by the fetch function.
The paginator exposes hasPrevious and hasNext as boolean values, enabling the calling function to be immediately aware of the previous and next page's state and availability.
The currentPage and pageSize are also exposed to assist the calling function in providing visual aids to the user.
The paginator enforces a specific signature for the fetchFunction (FetchFunction) and the schema of the response (FetchResponse) that it receives from the function.

Caveats

There are a few caveats with this technique, but they can be easily managed by making client-side caching optional.

The returned objects are cached on the UI side, and if the number becomes too large, the browser may experience slowdowns.
Additionally, if client-side caching is enabled, it may display stale data that could have changed on the server. Changing the page size using a new instance of Paginator will start afresh though, ensuring that the latest data is shown.

Improvements

The intention of the code and demo is to illustrate the concept of abstracting pagination logic rather than creating a full-fledged component for reuse. Consequently, the paginator presented in the demo is not perfect. There are opportunities for improvement, and I welcome anyone interested to submit a pull request addressing the following issues:

Exposing the current page and total pages through the paginator. This would require the FetchResponse class to be modified to include the total pages returned by the server.
Making client-side caching optional to address the concerns mentioned in the caveats.
Currently, if there is an issue calling the fetch function, the paginator logs the error and returns an empty array. It could be improved by propagating a valid exception to the caller.

Conclusion

In this article, we explored a TypeScript paginator class that implements client-side caching and provides a low-latency experience for users. It also encapsulates the pagination.

Navigating the Dual Write Problem: Implementing the Outbox Pattern for Data Consistency in Distributed Systems

Siddhartha S — Thu, 21 Nov 2024 04:30:35 GMT

Introduction

Maintaining data consistency in distributed systems is a challenge due to the possibility of service or component failures at any time.

In a previous article, I provided a detailed explanation of CQRS with Event Sourcing, discussing these two patterns in principle and elaborating on their implementation using .NET. A caveat that was left unanswered in that article was the possibility of failure when publishing an event after writing to the event store.

In this article, we will explore the classic distributed system problem known as the Dual Write Problem and a possible solution to it.

Dual Write Problem

Most of us are familiar with database transactions, where all Data Manipulation Language (DML) statements executed within the scope of a transaction are either fully committed or entirely reverted.

Now, imagine achieving a similar result across two entirely different systems. Consider the common scenario in which a microservice needs to update its database and then publish an event to an event bus (such as Kafka or an MQ). Refer to the diagram below:

Considering what can go wrong:

It’s a positive outcome if both transactions are successful.
However, if both transactions fail, it still presents a happy situation from a data consistency perspective.
If the database transaction succeeds but an issue arises before event publishing, leaving the event unpublished, this leads to an inconsistent state in our system.
Reversing the order of these operations alters the situation, but the underlying issue remains.

If the event fails to be placed onto the event bus due to its unavailability or a network issue, the event will be lost, resulting in inconsistent data.

Traditional databases address transaction problems through transaction logs, checkpointing, etc. However, in this case, we are dealing with two distinct systems.

The solution

The solution is to execute these transactions in a series. The state of the data should progress from its original state to an intermediate state and finally to the desired distributed state.

Here is a diagram to help you visualize the various scenarios:

The happy case occurs when the data flows to the database and then to the event bus.
In another happy scenario, data does not flow to the database (due to the database being down or some network issue), and the event bus also does not receive any events. The data state remains consistent.
The case where data writes are successful to the database but fail when attempting to publish to the event bus (due to network issues or an unavailable event bus) creates an intermediate state. Here, the intermediate data (I.D.) is recorded in the database. Whenever the event bus becomes available, we can replay the intermediate data, ensuring that the system reaches a consistent state.

Outbox Pattern

The Outbox Pattern (also known as the Transactional Outbox Pattern) is an implementation of the solution described above.

💡

It operates similarly to an email outbox. When an email is sent, it is first stored in the outbox, and once the email is successfully sent, the message is removed and placed in the sent items. Hence the name!

Refer to the diagram below:

The service writes data to the database and also records the event in an outbox table or collection. A worker process continuously monitors the outbox. As soon as an entry is made, it attempts to push it to the event bus and subsequently removes that entry from the outbox.

If the worker cannot place an entry onto the event bus due to issues like downtime or network problems, the event item will remain in the outbox and will be cleared once the worker successfully pushes it to the event bus.

A few points worth considering are:

The worker process may go down after publishing to the event bus but before removing the event item from the outbox. Therefore, any consuming service (not shown in the diagram) must be idempotent, as they may receive the same event multiple times.
In some implementations, the task of the worker process is handled by the service itself.

Conclusion

In this article, we examined the Dual Write Problem in distributed systems and explored a corresponding solution. We determined that the Transactional Outbox Pattern is a viable solution for the Dual Write Problem, guaranteeing at least once delivery to the event bus, necessitating that consumer services be idempotent. This pattern enhances the consistency of our system's data.

Efficient Data Handling: Understanding JavaScript Iterators and Generators

Siddhartha S — Wed, 20 Nov 2024 04:30:28 GMT

Introduction

Looping is one of the essential features of any programming language, and JavaScript is no exception. Coming from a C# background, I am familiar with the IEnumerable interface, which allows a class to be iterable. JavaScript provides a similar feature called Iterable, and when combined with generator function, it offers powerful iteration capabilities.

Background

JavaScript supports various looping mechanisms, including for...in and for...of, in addition to the standard for loop present in almost every programming language.

The for...in loop enables us to iterate through the keys of an object:

const obj = { a: 1, b: 2, c: 3 };
for (let prop in obj) {
    console.log(prop + ': ' + obj[prop]);
}

On the other hand, the for...of loop lets us iterate through a collection of objects:

const arr = [1, 2, 3, 4, 5];
for (let value of arr) {
    console.log(value);
}

This led me to wonder what allows the array object to function within a for...of loop. The answer lies in the Symbol.iterator. When we combine it with lazy-loading generator functions, we create powerful encapsulations over complex executions. Let's explore this further in the following sections.

Iterator

Symbol.iterator is a special function that, when defined on a prototype, grants looping capabilities to that object. In other words, we can use a for...of loop to iterate over that symbol. Here’s a sample code snippet to help you visualize this:

class FirstNNumbers {
    constructor(N) {
        this.N = N;
    }

    [Symbol.iterator]() {
       let count = 0;
       const limit = this.N;

       return {
           next: () => {
               count++;
               if (count <= limit) {
                   return { value: count, done: false };
               } else {
                   return { done: true };
               }
            }
        };
    }
}

// Example Usage
const firstFiveNumbers = new FirstNNumbers(5);
for (const num of firstFiveNumbers) {
    console.log(num); // Outputs: 1, 2, 3, 4, 5
}

The example above demonstrates a JavaScript class (which I prefer). However, if you favor a functional programming style, you can achieve the same functionality with functional code:

const FirstNNumbers = function(N) {
    this.N = N; };


FirstNNumbers.prototype[Symbol.iterator] = function() {
    let count = 0;
    const limit = this.N;

    return {
        next: () => {
            if (count < limit) {
                count++;
                return { value: count, done: false }; r
            } else {
                return { done: true };
            }
        }
    };
};

// Example Usage
const firstFiveNumbers = new FirstNNumbers(5);
for (const num of firstFiveNumbers) {
    console.log(num); // Outputs: 1, 2, 3, 4, 5
}

If you're a TypeScript enthusiast, you can certainly implement this in TypeScript as well, reaping the benefits of generics.

For more information, refer to the TypeScript documentation on iterators and generators: TypeScript Iterators and Generators.

Generator

Now that we've explored iterators, let's look into generators. Generator functions in JavaScript allow us to return a sequence that is lazy-loaded or evaluated later. This feature utilizes the yield keyword, enabling us to create a stream that can be paused and resumed. The function definition must include an asterisk (*). Here’s how a generator function looks:

function* numberGenerator() {
    let count = 1;
    while (count <= 3) {
        yield count; // Pause here and return the current count
        count++;
    }
}

// Example Usage
const generator = numberGenerator();

console.log(generator.next()); // { value: 1; first yield will be called here. }
console.log(generator.next()); // { value: 2, second yield will be called here. }
console.log(generator.next()); // { value: 3, third yield will be called here. }
console.log(generator.next()); // { done: true }

Async Please?

Generator functions can also handle asynchronous functionalities, making them extremely useful. This allows us to evaluate each sequence item for every yield call, leading to efficient memory usage and performance. Running the following example code using Node.js will demonstrate a slight delay between console prints:

async function* fetchDataGenerator(urls) {
  for (const url of urls) {
    const response = await fetch(url);
    const data = await response.json();
    yield data;
  }
}

const apiUrls = [
  'https://jsonplaceholder.typicode.com/posts/1',
  'https://jsonplaceholder.typicode.com/posts/2',
  'https://jsonplaceholder.typicode.com/posts/3',
];

(async () => {
  const dataGenerator = fetchDataGenerator(apiUrls);


  for await (const data of dataGenerator) {
    console.log(data);
  }
})();

The delay occurs because the yield keyword returns a value as soon as it becomes available, allowing the calling function to print it. During the next iteration, the fetchData generator retains its state, knowing exactly where to resume. It then fetches the next result and returns it immediately.

Mixing Both

The Symbol.iterator can be combined with asynchronous generator functions to create a powerful iteration abstraction within an object. Consider the following code:

//Dummy method to simulate some API call or time taking task.
const fetchItems = async (baseValue) => {
  await new Promise((resolve) => setTimeout(resolve, 500));
  const dummyArray = [];
  const startIndex = baseValue * 10;
  for (let i = baseValue; i < baseValue + 5; i++) {
    dummyArray.push({
      prop1: `Prop1: ${startIndex}`,
      prop2: `Prop 2:${startIndex * 10}`,
    });
  }
  return dummyArray;
};

class ItemEnumerable {
  constructor() {
    this.currentCounter = 1;
  }

  async *[Symbol.asyncIterator]() {
    while (true) {
      const users = await fetchItems(this.currentCounter);
      if (users.length === 0) {
        break;
      }
      yield* users;
      this.currentCounter++;

      if (this.currentCounter > 3) {
        break;
      }
    }
  }
}

async function loadItems() {
  const enumerable = new ItemEnumerable();
  for await (const item of enumerable) {
    console.log(`Item: ${item.prop1} & ${item.prop2}`);
  }
}

loadItems();

Explanation of the Code:

In the code above:

The ItemEnumerable class utilises the asynchronous generator in its Symbol.asyncIterator implementation.
The instance maintains an internal counter that is used to simulate API calls.
The fetchItems function simulates fetching five items for each call.
Once the API call returns, those five items are yielded immediately to the calling function (i.e., the for await...of loop) and printed.
The iterator's internal state allows it to perform three iterations, resulting in a total of 15 items returned: five items for each of the three iterations of the internal counter (currentCounter).
When executing this code, you'll notice a slight delay after every five items due to the simulated API call.

Output Example:

Item: Prop1: 10 & Prop 2:100
Item: Prop1: 10 & Prop 2:100
Item: Prop1: 10 & Prop 2:100
Item: Prop1: 10 & Prop 2:100
Item: Prop1: 10 & Prop 2:100
Item: Prop1: 20 & Prop 2:200
Item: Prop1: 20 & Prop 2:200
Item: Prop1: 20 & Prop 2:200
Item: Prop1: 20 & Prop 2:200
Item: Prop1: 20 & Prop 2:200
Item: Prop1: 30 & Prop 2:300
Item: Prop1: 30 & Prop 2:300
Item: Prop1: 30 & Prop 2:300
Item: Prop1: 30 & Prop 2:300
Item: Prop1: 30 & Prop 2:300

Use Cases:

Combining the Symbol.iterator and async generator functions can be beneficial for various applications, such as:

Streaming Data: Handling IoT sensor data, social media feeds, log files, etc.
Database Queries: Efficiently fetching and chunking data from databases.
Recursive Data Structures: Flattening complex structures using the combined functionality of iterators and async generators.

Conclusion

In this article, we explored the Symbol.iterator function and asynchronous generator functions. We demonstrated how they provide a fantastic abstraction for iterating over data, both synchronously and asynchronously. Additionally, we highlighted several practical use cases where this pattern can be advantageous.

Demystifying ARP and NAT: The Backbone of Internet Traffic

Siddhartha S — Tue, 19 Nov 2024 04:30:42 GMT

Introduction

In the first article of this series, Navigating Networking Layers: The Journey of Data Through TCP we explored how data flows through various layers of the network, transitioning from segments to packets to data frames.

In this article, we will delve into two essential concepts: Address Resolution Protocol (ARP) and Network Address Translation (NAT). These play crucial roles in facilitating communication between hosts within a private network and across the internet, respectively.

While this article stands on its own, readers who choose not to refer to the previous installment will still find valuable insights here. We aim to provide a clear understanding of how computers communicate with one another, whether they are situated within the same premises or on opposite sides of the globe.

ARP (Address Routing Protocol)

To begin, let's start with something relatable—our home internet setup.

The following diagram illustrates this setup:

Many of us may not have encountered a switch, as most Wi-Fi routers also function as switches. However, in the past, when internet connections were primarily provided via cables, all devices had to connect to a switch, which, in turn, connected to a router.

This setup constitutes a private network. As soon as you connect your device to the network, it is assigned an IP address. The CIDR ranges for private networks, as outlined in the RFC 1918 Address Allocation guideline, are as follows:

192.168.0.0/16
172.16.0.0/12
10.0.0.0/8

For most home users, the first CIDR range is the most common choice, as it provides a sufficient number of IP addresses for typical household usage. The IP address displayed on your device is not inherent to the machine itself; rather, it is an attribute of the Network Interface Card (NIC) connected to your device. The NIC also has a unique MAC address, assigned upon manufacture, which follows the format aa:bb:cc:dd:ee:ff. This means a MAC address is made up of six octets.

Both the MAC address of your NIC and the IP address assigned to your NIC serve as identifiers within a private network. It is essential to note that no two devices can share the same IP address or MAC address within the same private network (subnet).

Now, let’s visualize a scenario in which two hosts within a given subnet want to communicate.

Communication occurs using IP addresses; however, within a subnet, MAC addresses are used. This process is facilitated through an Address Resolution Protocol (ARP) table maintained by all hosts. The ARP table acts as a mapping device, linking IP addresses to MAC addresses.

Note that the IP and MAC addresses in the ARP tables have been abbreviated for brevity. For instance, '1' represents 192.168.0.1 and 'aa' denotes aa:aa:aa:aa:aa:aa.

There are two potential scenarios:

A host wishes to connect to another host within the subnet.
A host aims to connect to a host outside the subnet, residing elsewhere on the internet.

We will explore both scenarios in detail.

Scenario 1: Host Wants to Connect to Another Host within the Subnet

Starting with a fresh slate, the ARP tables of all hosts will initially contain only their own IP-MAC mappings. Let’s assume Host 1 wants to connect to Host 4. All Host 1 knows is the IP address of Host 4, which is 192.168.0.4. The following steps occur:

Determine Subnet Membership: Host 1 checks whether the target IP address (192.168.0.4) belongs to the same subnet (the method for this determination will be discussed in a later section).
ARP Table Lookup: Host 1 consults its ARP table but does not find an entry for Host 4.
Broadcast ARP Request: To locate Host 4's MAC address, Host 1 broadcasts an ARP request to all hosts within the subnet connected via the router/switch. For simplicity, we assume that the router functions as a switch in this scenario.

From the broadcast message, Host 4 learns the MAC address of Host 1. All hosts receive the broadcast, but only Host 4 will respond.

Host 1 then updates its ARP table and uses Host 4's MAC address to send messages.

How does a host determine if another host is within the same subnet?

A host determines if another host's IP address is within the same subnet by performing the following steps:

Obtain the Subnet Mask: The host uses its configured subnet mask to identify the network portion and the host portion of its own IP address.
Perform a Bitwise AND Operation: The host executes a bitwise AND operation between its own IP address and its subnet mask, and then does the same for the target IP address.
- Example:
  - Host's IP: 192.168.0.10
  - Subnet Mask: 255.255.0.0
  - Target IP: 192.168.0.5
- Convert to binary:
  - Host IP: 11000000.10101000.00000000.00000001
  - Subnet Mask: 11111111.11111111.00000000.00000000
  - Target IP: 11000000.10101000.00000000.00000101
- Perform bitwise AND:
  - Host: 11000000.10101000.00000000.00000000 (result: 192.168.0.0)
  - Target: 11000000.10101000.00000000.00000101 (result: 192.168.0.0)
Compare Results: If the results of the AND operation for both IP addresses are identical, then both hosts are in the same subnet. In the above example, since both results are 192.168.0.0, the host determines that 192.168.0.5 is in the same subnet.

Scenario 2: Host wants to connect to a host who is outside the subnet and is located somewhere on the internet

Now, let’s examine the scenario in which a host wants to connect to an IP address located somewhere on the internet.

Assuming Host 1 intends to connect to a server with the IP address 153.178.67.7, the following steps occur:

Determine Subnet Membership: Host 1 checks whether the target IP address (153.178.67.7) is part of its subnet.
Realize It’s Outside the Network: Upon realizing that the target IP is not within the same network, Host 1 prepares to send the message to the gateway.
Identify the Gateway: The gateway serves as a special host acting as a bridge to an external network. Host 1 will now consult its ARP table to find the MAC address of the gateway.
ARP Broadcast: In the previous diagram, the gateway is also represented by the router. Host 1 will issue an ARP broadcast to request the MAC address of the gateway (represented as ee:ee:ee:ee:ee:ee). It is important to note that this step is where ARP poisoning can potentially occur, a topic we will cover shortly.
Receive the Gateway’s MAC Address: The router replies with its MAC address, allowing Host 1 to send the message to the gateway.

At this juncture, you may be wondering why a request for an external IP address is sent to the gateway. We will explore this topic in detail in the upcoming NAT section.

ARP Poisoning

When a connected host broadcasts a message to the gateway, it is possible for a malicious host to respond to the broadcast, impersonating the gateway. If the reply from the malicious host reaches the requesting host before the legitimate gateway's response, the requesting host may mistakenly identify the malicious host as the gateway. Consequently, the requesting host will direct all internet requests to this impersonating host. This process is known as ARP poisoning.

NAT (Network Address Translation)

Internet gateways or routers that connect the Wide Area Network (WAN) and Local Area Network (LAN) possess two IP addresses: one internal IP and one public IP.

In the previous section, we observed that when a host needs to send a message to another host on the internet, it forwards the message to the internet gateway.

The Internet Gateway (IG) recognizes that the message must be routed to a remote server on the internet. It reads the From IP (192.168.0.1) and Port (6666) from the IP packet. Thus, internet gateways operate at Layer 3 of the networking model.

👉

This was discussed in Navigating Networking Layers: The Journey of Data Through TCP.

The IG makes an entry in its ARP table, which includes a public port that corresponds to the combination of the From IP and Port as well as the To IP and Port.

Referring to the diagram above, the ARP table of the router contains entries for two different hosts that communicate with the same application on the same server. The IG rewrites the headers of the IP packets, replacing the From IP and Port with its own public IP. As a result, the packets sent to the remote server now have a From IP and Port of 19.18.5.1:7777.

When the response from the remote server returns to the internet gateway, it arrives at port 7777. The IG checks its NAT table and recognizes that traffic on port 7777 should be redirected to the local machine at 192.168.0.1, port 6666.

Interestingly, API gateways, which serve as entry points to a microservices cluster, also perform NATing. They route requests to the appropriate microservice based on the incoming request, presenting only one gateway to the caller rather than multiple services.

Why Can’t All Machines Be Assigned a Public IP to Avoid Internet Gateways?

This limitation is precisely what will be addressed with IPv6. Currently, IPv4 has restrictions concerning the number of connected devices. NAT provides a solution, allowing many devices to connect to the internet while utilizing a limited number of public IP addresses.

Conclusion

In this article, we have explored two fundamental protocols—Address Resolution Protocol (ARP) and Network Address Translation (NAT)—that play critical roles in networking, facilitating seamless communication within private networks and between those networks and the internet.

We began with an overview of how ARP enables hosts within a subnet to resolve IP addresses into MAC addresses, thus allowing devices to identify and communicate with one another effectively. By maintaining an ARP table, each host can efficiently map IP addresses to their corresponding MAC addresses, ensuring reliable connectivity within a private network.

We then delved into NAT, which serves as a vital mechanism employed by internet gateways to allow multiple devices on a private network to share a single public IP address. This process not only conserves the limited number of IPv4 addresses available but also enhances security by hiding internal IP addresses from external networks. NAT enables the correct routing of data packets between local machines and remote servers, ensuring that responses from the internet are directed to the appropriate device within the private network.

Together, ARP and NAT form a foundation that underpins modern network communication. Understanding these protocols is essential for appreciating how data flows seamlessly across various layers of the network, whether between devices within the same premises or on opposite sides of the globe. As we continue to move toward an increasingly interconnected world, the importance of these protocols will only grow, paving the way for advancements in networking technology.

Understanding CQRS and Event Sourcing: A Path to More Robust Distributed Systems

Siddhartha S — Mon, 18 Nov 2024 04:30:42 GMT

Introduction

Distributed systems are all the rage these days! With system design gaining traction in the software world, everyone is eager to learn and implement distributed systems. They (Distributed Systems) undoubtedly have their merits and provide viable solutions to many of the challenges modern software faces. However, they come with a few caveats.

While numerous tools and products are offered by most cloud providers to help tackle these challenges, it’s essential for developers to be aware of the hurdles. After all, orchestrating independent, separate pieces of software to function seamlessly as one giant unit is no easy task. But don't worry; we’ll explore this together!

In this article, we'll focus on a well-established distributed pattern known as Command Query Responsibility Segregation (CQRS) and combine it with Event Sourcing to make it even more relevant. Please remember that these are two distinct patterns that, while often used together, remain independent of one another.

I’ve kept this article approachable and jargon-light, so let’s settle in for some quality time! Grab a coffee (or your favorite drink) as we dive into the design and implementation of a complete CQRS-Event Sourcing system with tools like Kafka, MongoDB, and PostgreSQL, and for added flair, we’ll build a Next.js frontend. Though we’ll be working in .NET, the concepts are relevant and useful for any backend developer.

CQRS

Now that we know CQRS stands for Command Query Responsibility Segregation, let's take a quick look at what it is and why it’s beneficial.

What is CQRS and Why Do We Need It?

CQRS is a design pattern that allows us to separate the read operations of our domain entities from the write operations. This separation enables us to independently scale the reading operations separate from writing. If you’ve been exploring distributed system designs for a while, you may already be familiar with the famous Tiny URL system design question. It states that once a URL is generated (written), it is read at least 100 times. Thus, the read-to-write ratio for a domain entity (the tiny URL, in this case) is 1:100.

Doesn’t it make sense to separate the read and write functionalities so that we can scale the reading independently while maintaining lower infrastructure costs for writing? This is a classic case for CQRS, and trust me, a lot of the software we build today revolves around this concept.

Let’s illustrate this with a straightforward example: customers and orders.

Info: A customer can be created, and a customer can place many orders.

If I were to design the database schema for storing this data, I would create the following two tables:

Customers Table:
+--------------+-----------------+----------------+----------+
| CustomerId   | Name            | Email          | Phone    |
+--------------+-----------------+----------------+----------+
| 1            | Alice Smith     | alice@email.com| 123-4567 |
| 2            | Bob Johnson     | bob@email.com  | 234-5678 |
| 3            | Carol Davis     | carol@email.com| 345-6789 |
+--------------+-----------------+----------------+----------+

Orders Table:
| OrderId  | CustomerId   | OrderDate          | Amount |
+----------+--------------+--------------------+--------+
| 101      | 1            | 2024-05-01         | 150.00 |
| 102      | 1            | 2024-05-03         | 200.00 |
| 103      | 2            | 2024-05-02         | 300.00 |
| 104      | 3            | 2024-05-04         | 120.00 |
| 105      | 1            | 2024-05-05         | 50.00  |
+----------+--------------+--------------------+--------+

In this schema, the CustomerId from the Customers table serves as a foreign key in the Orders table.

For the requirements listed in the left column of the following table, we will perform the corresponding operations mentioned in the right column:

Requirement	Operation
Add a customer	Add the entry in the Customers Table
A customer places an order	Add entry in the Orders Table with the Customer ID
Generate reports for purchases between specified dates	Join the Orders table with the Customers table to obtain information
Generate reports for daily purchases	Join the Orders table with the Customers table to obtain information

This approach works well for a small store application with a limited customer base.

But what happens when we move to larger stores that cater to countless customers and want to leverage purchase orders to uncover meaningful patterns? These larger enterprises will require various analytical tools to gather data for multiple analyses. Here are a few examples:

What were the total sales for each month over the last year, categorized by customer segment? [Sales Performance]
Which customer segments have the highest purchase amounts and frequency of orders? [Customer Segmentation]
What are the top five products sold by order count and total revenue for the last quarter? [Product Performance]
What percentage of customers made repeat purchases in the last six months? [Customer Retention]
What is the average time taken to fulfill orders for each customer segment? [Supply Chain Assessment]
How did customer orders change during and after the last promotional campaign? [Marketing Analysis]
What revenue is generated from different regions, and how does it compare to customer demographics? [Geographic Analysis]

There can be numerous use cases requiring data from the databases above. Some of these services will also need real-time data. Just imagine the number of joins your database will have to perform!

So, we have a problem. Our database is normalized, making it well-suited for OLTP (Online Transaction Processing). However, this database would struggle under the demands of our analytical processes.

Solution

The solution is to break operations into two distinct flows and utilize two databases. These databases will have replicated data but distinct schemas. While write operations can continue feeding data into the previous schema (OLTP), we can have another database designed with a read-friendly schema for the OLAP (Online Analytical Processing) flow.

Thus, the same data captured in the earlier case would now be organized like this:

CustomerOrders Table
+--------------+-----------------+----------------+----------+------------+--------+
| CustomerId   | Name            | Email          | OrderId  | OrderDate  | Amount |
+--------------+-----------------+----------------+----------+------------+--------+
| 1            | Alice Smith     | alice@email.com| 101      | 2024-05-01 | 150.00 |
| 1            | Alice Smith     | alice@email.com| 102      | 2024-05-03 | 200.00 |
| 2            | Bob Johnson     | bob@email.com  | 103      | 2024-05-02 | 300.00 |
| 3            | Carol Davis     | carol@email.com| 104      | 2024-05-04 | 120.00 |
| 1            | Alice Smith     | alice@email.com| 105      | 2024-05-05 | 50.00  |
+--------------+-----------------+----------------+----------+------------+--------+

If you notice, while the above table schema may not be normalised, it eliminates the need for joins for read operations!

Here is a rough diagram of what we discussed so far:

You might argue that storing duplicate data is wasteful, but I would counter that data storage has never been as cost-effective as it is today! Investing in extra storage saves a great deal of computational costs and minimizes the risk of database failure.

Pros and Cons of CQRS

Pros:

Separation of concerns. Separate read and write models. Enable both flows to grow independently.
Scalability increases.
Improved security. Read and Write operations are separate flows so it's easy to implement selective security.

Cons:

Increased complexity.
Data consistency issues as the read and write databases should be in sync always.
Data duplication. (But we already countered that, didn’t we?)
Not for small systems.

By the way, if you’ve followed along, you now have a solid understanding of CQRS! We have successfully segregated the responsibilities of Command (Write) and Query (Read). How cool is that!

Event Sourcing

Let's start with a clean slate! Understanding Event Sourcing is straightforward if you are already familiar with the State Pattern. The State Pattern is a behavioral design pattern that maintains the current state of an entity based on the actions taken on it.

In Event Sourcing, whenever we perform an action on any object, we essentially raise an event for that object. This pattern revolves around storing these events in what we call an event store. The final state of that object is then the aggregate of all those events.

Example: Bank Account

Let's illustrate this with an example: imagine a bank account as an entity that supports the following events:

Created: The account is opened when you walk into the bank and create it, receiving an account ID.
Deposit: You deposit money into the account.
Withdraw: You withdraw money from the account.
Closed: You close the account.

Consider the following time series of events over the account's lifespan, which spans a year:

Clearly, various events are recorded throughout the life of the account. Now, if storing an updated balance seems simpler, why not just maintain the balance directly?

During account closure, when a customer requests the available balance (let's say $300), one might think it could simply be stored in our banking system. However, that’s not how customers expect it to work! Banks, and many systems, require a log of all transactions to ensure complete transparency from creation to closure. This audit trail is essential for both customers and banks.

The bank records all transactions on the account for audit purposes and can aggregate them to derive the final balance at any given moment. For example, if I query the banking system to provide the balance as of May 28th 2023, the system will replay all the events from creation:

Starting balance: $0
+$500 (Deposit)
-$200 (Withdrawal)
-$100 (Withdrawal)
+$100 (Deposit) = $300.

In summary, Event Sourcing involves recording events for an entity in an event store. Whenever the state of the object is needed for a specific point in time, all relevant events can be replayed from the beginning to arrive at the entity's final state.

Now, let’s reconsider the definition of the entity. If the entity’s state is an aggregate of the events, why not refer to it as an aggregate? Indeed, we can redefine a bank account as an aggregate that has an ID (aggregate ID, also known as account ID) and contains a collection of all its events (transactions). The state (balance) can then be derived by replaying all the events associated with the aggregate.

If you follow along, congratulations—you now understand Event Sourcing!

Pros and Cons of Event Sourcing

Like any architectural pattern, Event Sourcing has its pros and cons:

Pros:

History and Auditability: All changes are recorded, providing a complete history for auditing purposes.
Temporal Query Support: Enables historical analysis by allowing you to query states at specific points in time.
Analytical Data: Captures not only the data but also the transitions and history, making it suitable for analytics.

Cons:

Storage Requirements: A large volume of events may accumulate, even for relatively simple entities, leading to increased storage needs.
Complex State Retrieval: There’s no quick way to access the current state of an entity without aggregating all the events, which can be time-consuming.

By considering these aspects, one can make informed decisions about whether to implement Event Sourcing in a given system.

Combining CQRS and Event Sourcing

Now that we understand what CQRS and Event Sourcing are, you may already be starting to see why and how they are often combined.

Event Sourcing provides a way to approach data for historical purposes and enhances auditability. However, it does come with slower read capabilities. On the other hand, CQRS allows us to separate our read and write flows.

By using CQRS for our writing flow and keeping aggregated data in our read flow, these two patterns empower us to scale efficiently and store more granular data simultaneously! This synergy typically operates within an event-driven architecture. Here is a complete diagram that ties everything together and fills in the missing pieces!

Flow Overview

Let’s iron out the flow for clarity:

The Write Service Receives an Event: This service is responsible for handling incoming commands and generating events.
Writing to the Event Source: The write service records the event in the event store, which is always append-only, meaning there are no delete or update operations.
Publishing to the Event Bus: The write service then publishes a projected event to the event bus. Note that this is a projection of the received event, not the exact event itself.
Listening for Events: The read service, which listens to the event bus, receives the projected event.
Updating the Read Database: The read service processes the projected event and makes the necessary updates in the read database, as this database reflects the final state of the entity.

Pros and Cons

Pros:

Scalability: The architecture allows for independent scaling of the read and write flows.
Access to Historical Data: You can access historical data while achieving fast reads since the read service stores the final state.
Separation of Reading and Writing: The distinct paths for reading and writing enhance clarity and manageability in the system.

Cons:

Consistency Issues: Consistency could become a problem if something goes wrong. For example, if the event bus goes down or the read service is not operational while the write service is publishing events, there may be discrepancies.
Eventual Consistency: Data consistency is not immediate; it is eventual. This is due to the delay between when the write service receives the event and when the read service updates the read database.

By understanding these dynamics, we can effectively leverage CQRS and Event Sourcing to create low latency, scalable applications that meet the demands of many large scale systems.

Design of CQRS Plus

In my attempt to learn CQRS with event sourcing, I started by developing a couple of .NET Web API applications: one for reading entities and another for writing events on them. Gradually, these applications evolved into a labor of love, leading me to integrate a gateway using YARP and a UI with Next.js, resulting in a comprehensive codebase.

Here is the repository :

👉

https://github.com/OxyProgrammer/cqrs-plus/

The repository's README file contains all the instructions you need to try it out on your end. Note that Docker Desktop is required to run the entire application.

Tech Stack

The services are written in C#.

Kafka is used as the event bus for its high availability.
PostgreSQL is utilized for the query service (reading entities) to provide the final state, leveraging its ACID capabilities as a robust RDBMS.
MongoDB serves as the event store due to its ability to handle large event store growth, making it suitable for horizontal scaling.
The UI is built with Next.js using TypeScript because I enjoy working with React!
The entire application is hosted using Docker and Docker Compose, as nothing is better suited for running a distributed application on a local machine.

Design

If you have read the article this far, the design of CQRS Plus should be straightforward to understand:

Command Service

The controllers, upon receiving requests, raise a command to the command handler.
Command handlers check the command and call the appropriate API on the event sourcing handler.
The event sourcing handler interacts with the MongoDB event store, saving an event for an aggregate and signaling the event producer to publish an event.
The event producer, an abstraction over the Kafka event bus, is responsible for placing an event in the Kafka stream.

Query Service

The Query Service includes a listener called EventConsumer, thanks to .NET's Worker Services and AddHostedService. It pulls events from a Kafka topic.
Upon receiving an event, the EventConsumer invokes the appropriate handler in the event handler, which then writes data to PostgreSQL.
The query service handles read requests by delegating queries to the QueryDispatcher (similar to the CommandDispatcher in the command service).
The QueryDispatcher invokes the appropriate handler in the query handler, which reads data from PostgreSQL and serves the requested data.

ASP .Net Caveats

DTOs, Events and Entities

The entities carrying data within our applications should not be the same as those transmitting data externally. Therefore, request/response DTOs are often converted from/to data entities responsible for interacting with databases.

Events are models that transmit data across services through an event bus.

Although all are plain C# classes, they serve very different purposes, justifying their categorization.

Open API Support

CQRS Plus adheres to the OpenAPI standard for defining request routes. For example, if you want to read a comment with a comment ID cid under a post with a post ID pid, the request route would be: /api/v1/posts/{pid}/comments/{cid}

Gateway

A gateway acts as a thin abstraction layer over the different microservices within the cluster. The YARP gateway, also functioning as a reverse proxy, ensures there are no separate API services for the frontend, presenting a unified API interface to the caller.

Improvements

While CQRS Plus is nearly complete, it cannot yet be considered perfect. Currently, the operations involving saving to the event store and publishing to the event bus do not occur atomically. This process could be improved by implementing the outbox pattern. If I decide to implement it, I will update this article accordingly.

Further

A future topic of interest could be the deployment strategies for this microservice application. While there are many options, two notable choices include:

Kubernetes with managed instances like AKS (Azure) or EKS (AWS).
Elastic Container Service (ECS).

Conclusion

In this article, we explored a distributed implementation of the CQRS with Event Sourcing pattern. We delved into the intricacies of these patterns and their significance. We also examined the caveats associated with these approaches. While CQRS effectively addresses specific issues, it isn't necessary for all problems. We reviewed the implementation of a distributed CQRS with Event Sourcing and the rationale behind the technological choices.

Circular Buffers: The Smart Solution for Managing Data Streams

Siddhartha S — Sun, 03 Nov 2024 12:03:58 GMT

Introduction

In the fast-paced world of finance, technology teams often encounter situations where they must handle an extraordinarily high flow of data. For instance, when processing real-time stock market feeds, every millisecond counts. Keeping systems within the permissible bounds of memory footprint and latency is essential, making efficient data structures critical. One such structure is the circular buffer. This concept, while not new, is often misunderstood, as many resources fail to address real-world caveats that arise when implementing it. Circular buffers are ideal when you need to maintain the latest n records of data from a feed rather than historical data, ensuring that you only keep what’s most relevant.

What is a Circular Buffer?

A circular buffer is a fixed-size data structure that uses a single, contiguous block of memory to maintain a dynamically changing dataset. When new data arrives, it overwrites the oldest data once the buffer is full, thereby ensuring that the buffer consistently contains the most recent entries.

Background

There are numerous implementations of circular buffers available, but I opted for a straightforward implementation that effectively served my purpose in the most elegant way. The requirement was to display market data on a C# GUI, while the data feed provider was delivering information at an extraordinarily high speed—like to a dragon breathing fire.

Fortunately, the user was only concerned with the last 10 feeds, rather than the historical data. This is the perfect scenario for using a circular buffer, which excels in scenarios where only the latest entries are relevant. Among the various implementations available, the most suitable one for our needs was a circular buffer implementation that supported the overwriting feature.

Real-world Applications

Circular buffers are not limited to finance; they find applications in various fields:

Networking: To handle incoming data packets without overwhelming the system.
Multimedia Streaming: For buffering audio and video streams to ensure smooth playback.
Sensor Data Management: In IoT devices, to keep track of the most recent sensor readings efficiently.

Implementation

Implementing a circular buffer is not overly complex and can be easily achieved using an array and two pointers, called head and tail. The head tracks the element that needs to be read, while the tail tracks the position where data will be added. Since it is a buffer, a fixed-size array is the best data structure for this purpose.

+---------------------+
|                     |
|    Circular Buffer  |
|                     |
+---------------------+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
+---+---+---+---+---+---+---+---+---+
| x | x | x |   |   |   |   |   |   |  <- Existing data
+---+---+---+---+---+---+---+---+---+
^(head)                  ^(tail)

Where:
 - 'x' indicates filled slots (buffer elements).
 - an empty slot is indicated by a blank space.
 - 'head' indicates where data is read from.
 - 'tail' indicates where new data will be added/overwritten.

As the tail moves to the end of the array, the new item overwrites the first element, as this is now the oldest. Since we care only about the latest n elements, overwriting the oldest data is acceptable.

 +---------------------+
 |                     |
 |    Circular Buffer  |
 |                     |
 +---------------------+
 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
 +---+---+---+---+---+---+---+---+---+
 | y | x | x | x | x | x | x | x | x |  <- After adding new data
 +---+---+---+---+---+---+---+---+---+
 ^(head)                             ^(tail)

Where:
 - The pointer `tail` has moved to overwrite the oldest data (index 0 now contains 'y').
 - Similar operations can continue as new data is added, overwriting older data in a circular fashion.

Caveats

One important consideration is that, instead of simply overwriting the reference of the first element, we could optimize our implementation by overwriting the properties of the first element. You may argue that reading and writing properties takes time, but the same can be said for garbage collection.

Additionally, some caveats to consider include:

Buffer Size: Choosing an inappropriate buffer size can lead to data loss if the buffer fills up too quickly.
Concurrent Access: In multi-threaded situations, care must be taken to synchronize access to the buffer to prevent data corruption.

Here is a screenshot of a comparative run conducted with 1 million objects in a buffer of size 5:

The circular buffer implementation and test code is available at the following location:
https://github.com/OxyProgrammer/circular-buffer

Performance Metrics

In my implementation, I found that using a circular buffer significantly improved performance in terms of memory usage and processing speed when compared to traditional array methods, which require shifting elements on every insertion.

Conclusion

Circular buffers are the default go-to data structure for scenarios with a heavy inflow of data when we only need to care about the last n items. They optimize memory usage and enhance processing efficiency, particularly useful in real-time applications. For smaller objects, overwriting properties is often more beneficial than overwriting the reference of an object, as it helps to avoid unnecessary garbage collection cycles.

Navigating Networking Layers: The Journey of Data Through TCP

Siddhartha S — Sat, 02 Nov 2024 15:30:35 GMT

Introduction

As backend engineers, we often take many things for granted without even realizing it, particularly what's happening beneath the surface. Although application programmers may not need to understand these underlying processes, architects must grasp the internals. This understanding is crucial for several reasons:

Choice of tools
Designing the data flow for the system
Estimating costs
Extensibility of the system

Failing to properly assess any of these factors can lead to significant losses in time and cost, as well as challenging system maintenance.

When making an HTTP call, numerous processes occur behind the scenes. Your data is transformed from JSON to bytes, to segments, to packets, to data frames, and finally, to radio waves or electrical signals.

In this article, I aim to demystify the journey of an HTTP call to a remote server. As a bonus, we will also explore the differences between Application Load Balancers (ALB) and Network Load Balancers (NLB) in AWS. We will examine the cost differences between these AWS offerings with our newly acquired knowledge.

Traveling through the layers of network

Most of us are probably familiar with the OSI model of networking. However, it often remains an academic concept in our minds. Visualizing how data travels and transforms through the various network layers can be challenging. Revisiting the OSI model is a good starting point, but this time let's focus on visualizing how the data transforms.

The accompanying diagram explains how a request travels from the sender (typically the client) to the receiver (typically the server). It assumes that the receiver’s IP address and ports are known. If the IP address and port of the receiver are unknown, an additional DNS request is needed to resolve the IP address.

💡

You can easily perform a DNS query using curl to retrieve the results. For example, the IP address and port for my domain can be readily located:

Before delving into the diagram, two points are worth noting:

The layers are logical, not physical, and should be viewed as such.
The processes aren't strictly separated. Routers and switches may perform operations associated with other layers. For example, many routers (layer 3) also function as switches (layer 2). More on this later.

The application code resides at the application layer, where a fetch call is made to a Node.js API or another technology. The fetch request includes the IP, port, and request object, which is a JSON object.
At the presentation layer, the request object is transformed into bytes.
The bytes and host port information are sent to another API call in the session layer, which maintains the state of the connection as it forms in the following layer.
Now the flow enters the OS level. Each OS implements its handling from the transport layer, where a TCP connection is established with the receiver. TCP data can only be passed if a valid connection exists, meaning both sender and receiver operating systems maintain the state for integral data transmission over TCP. Data is broken into SEGMENTS, with headers marked on all segments, containing all flags and data necessary for TCP communication and maintaining data integrity. Understanding the TCP header in detail is a topic for another article.
In the network layer, segments are turned into IP PACKETS, containing the receiver IP address. The port information remains within the segments inside the IP packets. These packets are then passed into another API call that takes them to the data frame layer.
The data frame layer packages IP packets into data FRAMES, which include MAC addresses the frame should be forwarded to. In this case, it's the sender router’s MAC address.
The physical layer transmits data frames as radio waves (Wi-Fi) or light signals (optical fiber) to the sender router.
The sender router checks the data frame and underlying IP packet, recognizing the recipient IP is meant for a distant server. It performs Network Address Translation (NAT), replacing the sender’s IP and port with its public IP and a port. To perform NAT, routers need access to layer 4 data (ports), hence they operate from layer 4 to 2.
After several hops across ISPs and routers, the frame reaches the destination network and its gateway, the receiver router.
The receiver router checks its NAT table to determine the internal host and port for the request, and consults the ARP table for the server's MAC address.
The router forwards the request to the correct receiver, which extracts the IP packets from the data frames and the segments from the IP Packets, then handles the TCP segments using the port specified to deliver data to the appropriate program.
The response from the receiver travels back via a similar route. Hence, I use 'sender' and 'receiver' instead of 'client' and 'server.'

You may wonder why terms like segments, packets, and frames are emphasized. This is to highlight:

Segments for TCP
Packets for IP
Frames for Data

These elements fit together like Matryoshka dolls.

Questions like what NAT Tables and ARP Tables are, and how NAT works, will be explored in another article. For the OSI model and data transformation across layers, understanding the data transformation is key. Details about the nuances of network data travel will be addressed later.

Load balancers: NLB vs ALB

Load balancers function as reverse proxies with the added capability of distributing traffic efficiently across instances of services. Having already explored the various layers of the OSI model and how data transforms as it travels through them, we are well-equipped to understand the differences between Network Load Balancers and Application Load Balancers. This comprehension will also help us grasp why their costs differ among AWS offerings.

Network Load Balancers (NLB)

NLB operate at Layer 4 of the OSI model, which means they primarily manage connections and segments. As illustrated in the accompanying diagram, when clients establish TCP connections with the load balancer, the segments are transmitted without the network being aware of the entire request that these segments make up. Consequently, the load balancing operation is inherently sticky, meaning that all segments from a given connection are directed to a specific server. For example, all blue segments are paired with one server, while all orange segments correspond to another (refer following diagram). Additionally, since NLBs lack visibility into higher-layer data, they do not support caching or advanced routing functionalities, making them purely focused on load balancing while ensuring security by not exposing higher layer information.

Pros:
1. Simple and efficient for distributing traffic based on TCP connections.
2. High performance with low latency.
3. Secure, as it doesn’t handle higher-layer payloads.
Cons:
1. Limited to Layer 4; lacks visibility into application-level data.
2. No caching support or advanced routing capabilities.
3. Sticky sessions by default, which may not suit all applications.

Application Load Balancers (ALB)

ALBs, in contrast, function at Layer 7 of the OSI model, allowing them to analyze the entire incoming request. As shown in the figure, ALBs maintain a buffer and act as independent receivers of segments, which enables them to implement smart routing capabilities based on specific request route keys. This capability enhances their efficiency in directing traffic to appropriate service clusters. Moreover, ALBs can also perform TLS offloading, necessitating the management of TLS certificates, which some may view as a potential security risk. Unlike NLBs, ALBs can provide caching support, further optimizing the handling of incoming requests.

Pros:
1. Operates at Layer 7, enabling detailed inspection of requests for smart routing.
2. Provides caching support, enhancing response times.
3. Capable of performing TLS offloading, improving security management.
Cons:
1. More complex and may introduce additional latency due to extensive processing.
2. Requires management of TLS certificates, which can pose a security risk if not handled properly.
3. Potentially higher costs compared to NLBs due to more advanced features.

OS facilitating TCP connection under the hood

A TCP connection is established through a three-way handshake that consists of SYN, SYN-ACK, and ACK requests. The SYN represents the synchronization of the client’s initial sequence number for transmission.

The diagram below illustrates the connection establishment process. This is a well-known fact, and you are likely already familiar with it.

Client                               Server
   |                                       |
   | ---------------- SYN ---------------> |
   |                                       |
   | <-------------- SYN-ACK ------------- |
   |                                       |
   | ---------------- ACK ---------------> |
   |                                       |
   | <-------- Connection Established ---- |
   |                                       |

The interesting part lies in understanding what happens under the hood in the server’s operating system while these SYN and ACK requests are exchanged. After several attempts to write this explanation, I decided to create a comic strip to illustrate the process. I believe that following the comic strip while reading along will enhance comprehension. Additionally, we will correlate this process with high-level code for easier familiarity with our everyday coding experience.

Let's first discuss the server processes starting up in the OS. It’s quite common for a single server to run multiple server programs listening on different ports. For instance, two processes—Process-1 and Process-2—could start on ports 3000 and 8000, respectively. The Node.js code to initiate these server programs would look like this:

// Import the required modules from Node.js
const http = require('http');

// Create the first server instance for handling HTTP requests on port 3000
const process1Server = http.createServer((req, res) => {
    // Process incoming requests for this server
    res.end('Hello from Process 1 on port 3000!');
});

// Creating the second server instance for handling WebSocket connections on port 8000
const process2Server = http.createServer((req, res) => {
    // Process incoming requests for this server
    res.end('Hello from Process 2 on port 8000!');
});

// Making the first server listen on port 3000
process1Server.listen(3000, "192.168.0.5", () => {
    console.log('Server is listening on port 3000');
});

// Making the second server listen on port 8000
process2Server.listen(8000, "192.168.0.5", () => {
    console.log('Server is listening on port 8000');
});

With the call to the listen() method, Node.js makes internal system calls to create sockets bound to the addresses 192.168.0.5:3000 and 192.168.0.5:8000, respectively. For each socket, file descriptors are created. Sockets can be thought of as files, and file descriptors serve as identifiers for those files.
The OS maintains two queues: the SYN queue (also known as the half-connection queue) and the accept queue. When a SYN request arrives from a client, the OS immediately sends back a SYN-ACK to the client. the OS creates an entry in the SYN queue containing the following details:
- Sender Address
- Sender Port
- Receiver Address (in this case, 192.168.0.5)
- Receiver Port (in this case, 3000)
- Client’s Initial Sequence Number
The client then sends an ACK, at which point the entry in the SYN queue is moved to the accept queue.
The accept queue is a data structure that server processes monitor closely, often utilizing a mutex. They repeatedly call accept() on its entries.

In this example, Process-2, which has the file descriptor for the socket created for 192.168.0.5:8000, attempts to call accept() on the entry. However, the OS checks and denies the request because the entry corresponds to 192.168.0.5:3000.
Next, Process-1 calls accept() successfully, as its file descriptor matches the entry in the queue. At this point, the OS creates a new socket and file descriptor for this connection using the data from the accept queue entry:
- Sender Address
- Sender Port
- Receiver Address (in this case, 192.168.0.5)
- Receiver Port (in this case, 3000).

The OS then hands over this socket to Process-1, which now has all the necessary details for communication between the client and the server.

Conclusion

This article has provided an in-depth look at the layers of networking and the process of creating a TCP connection. If you've learned something new from this discussion, you likely recognize the high cost of network calls and may already be considering ways to minimize them. Truly understanding the intricacies of network calls often leads to reservations about adopting a Microservices Architecture, as it inherently involves an increase in network calls, which can introduce complexity and latency.

While this article explores various concepts in detail, it inevitably leaves many questions unanswered. For instance, although we examined the journey of data from sender to receiver, we only touched on the ideas of ARP and NAT without fully elaborating on them. Additionally, while we discussed TCP connection establishment, TCP encompasses much more than just this aspect. How does it guarantee data integrity? How does it regulate the flow of data between sender and receiver? After all, TCP has been a foundational protocol for nearly 40 years and continues to underpin many of the major network communications we rely on today.

In future articles, we will work to clarify these remaining details and further demystify the complexities of networking.

Exploring Memory Management: A .NET Developer's Insights into Golang

Siddhartha S — Sat, 26 Oct 2024 18:30:36 GMT

Introduction

Over a decade ago, as I explored the intricacies of .NET memory allocation, it struck me as intuitive and elegant. The CLR efficiently manages the Managed Heap as a contiguous slice of memory, optimizing garbage collection—a brilliant concept that endures. Fast forward to today, as I explore Golang, I find my understanding of "elegance" expanding.

Memory management in Golang differs significantly from .NET, and for good reasons. Go doesn't adhere to the OOP paradigm like C#, necessitating a unique approach to memory management and garbage collection. While the differences between Go and .NET are numerous and worthy of separate discussion, this article focuses on memory allocation and garbage collection in these two robust technologies.

Background: Memory Collection in .Net

If you're reading this, you likely have some familiarity with .NET memory allocation and garbage collection. Let's briefly recap key features to set the stage for comparison:

Functions operate on the stack, where primitive types are declared.
Reference types instantiated during function execution are allocated on the managed heap.
Primitive types encapsulated by reference types are also allocated within the heap.
When a function completes execution and no variables reference the heap location, the memory is considered orphaned and eventually freed by the Garbage Collector.

This is a simplified overview of .NET Garbage Collection, omitting complexities like the Large Object Heap (LOH), yet it captures the essence of the process.

The Plight

Go's memory allocation is more complex. Unlike C#, where all classes are reference types and hence goes to Managed Heap, Go employs pointers. It may seem that using pointers in Go transforms a value type struct into a reference type, similar to .NET garbage collection.

Well, let's find out with an example:

To summarize the above code:

Line 17: The main method begins program execution.
Line 19: Declares the struct SmallStruct.
Line 20: Invokes changeSmallStruct(&smallStruct).
Line 21: Prints and verifies changes.
Line 23: Calls newSmallStruct().
Line 24: Prints and confirms initialization.

Here we discuss just the plight, and so I will just leave you with a question and its answer:

Question: Which instance of the two (smallStruct, anotherSmallStruct) will get allocated to where (stack, heap)? Remember both are pointers and hence references.

Answer: smallStruct → Stack, anotherSmallStruct→ Heap

Demystifying the plight

To understand what's happening, we will need to have a look at how things work in the stack as the program proceeds.

Below ascii art diagram would explain the stack operation and the explanation follows that:

[Line 17 Defines the main method and that's where the program execution will start from.]
Stack:
|----------------|
| main()         |
|----------------|

[Line 19: Declares the struct called SmallStruct.]
Stack:
|----------------|
| main()         |
| smallStruct    | <- Allocated on stack (Name: "Sid", City: "Mumbai")
|----------------|

[Line 20: Calling changeSmallStruct(&smallStruct). This is when a new stack frame is loaded.]
Stack:
|----------------|
| changeSmallStruct() |
| arg (pointer)  | <- Points to smallStruct in main()'s frame
|----------------|
| main()         |
| smallStruct    |
|----------------|

[After changeSmallStruct() completes]
Stack:
|----------------|
| main()         |
| smallStruct    | <- Modified (Name: "Siddhartha", City: "Bangalore")
|----------------|

[Line 23: Calling newSmallStruct()]
Stack:
|----------------|
| newSmallStruct() |
|----------------|
| main()         |
| smallStruct    |
|----------------|
Heap:
[SmallStruct]    <- Allocated on heap (Name: "James", City: "Panji")

[After newSmallStruct() returns]
Stack:
|----------------|
| main()         |
| smallStruct    |
| anotherSmallStruct | <- Pointer to heap-allocated SmallStruct
|----------------|
Heap:
[SmallStruct]    <- (Name: "James", City: "Panji")

[Program End]
Stack:
(empty)
Heap:
[SmallStruct]    <- Will be garbage collected

Explanation of the above ascii art diagram is as follows (you may would like to refer the explanation and the diagram together):

In line 19, the memory for the struct SmallStruct gets allocated on stack and a pointer for that is created. The pointer is assigned to the variable “smallStruct”

In line 20, the method changeSmallStruct gets called and a new stackFrame is loaded for that.
In lines 12-15: The pointer gets passed to the method changeSmallStruct and this method works on the pointer by dereferencing it and then changing the fields that it is pointing to. Thus the memory in the stack frame of the main() method that the pointer is referencing to, gets updated (Field names get changed to Siddhartha and Bangalore respectively).
The stack frame for the method changeSmallStruct() is unloaded and the stack memory is reclaimed.
In line 23: New Stack frame gets loaded for newSmallStruct().
In line 9: The same thing happens that happened in line 19. The memory for the struct SmallStruct gets assigned in the stack frame of newSmallStruct() method. And this pointer is returned to the stack frame of the main() method in line 23.

💡

The italicized line in the above point is not correct, but please read on!
The next logical step would be to unload the stack frame for method newSmallStruct. But here comes the paradox. The actual memory that had been allocated for SmallStruct is in the stackframe that needs to be unloaded. The main method stack frame now has the reference of a memory that needs to get unloaded! That is why the go compiler in line 9, would not allocate the memory for the struct in the stack frame of newSmallStruct(). It would rather create it in the Heap because the stack frame that it would get created in needs to be unloaded with a pointer still referencing it from the main method.
The main program finally exits and the memory for the SmallStruct {Name: “James”, City: “Panji”} heap will be garbage collected.

Phew, so certainly the memory allocation and garbage collection is done quite differently in Go!

Further

At this point there may be several questions popping in your head:

Is this it, is it the only way Golang decides to allocate memory?

The Go compiler uses a technique called escape analysis to determine whether a variable should be allocated on the heap or the stack. Generally, the Go compiler tries to allocate variables on the stack when possible, as it's more efficient.

Variables are typically allocated on the heap when:

They are too large for the stack
Their size is not known at compile time
They are shared with other goroutines
They outlive the function that created them.

You may refer to go documentation. Here are few links:

Why do we care where the memory allocation happens?

Most of the time we don’t and we shouldn't. But too much garbage collection can become a bottleneck for the application’s performance and being aware of how the memory allocation works lets you write better code that aligns well with the go mindset. As an example I would like to draw your attention to a go standard library method:

https://pkg.go.dev/io#Reader

type Reader interface {
    Read(p []byte) (n int, err error)
}

Notice, the Read method is taking a byte slice and passing it down, rather than returning the byte slice as a return type. This is for the reasons discussed above.

How do we always know where the memory is getting allocated?

Actually, even seasoned golang devs can be tricked while predicting the memory allocation. So it's always better to use build tools to find the memory allocation of the objects getting created in the program. One such tool is the following:

go build -gcflags -m main.go

For the above demo program it would output the following:

./main.go:8:6: can inline newSmallStruct

./main.go:12:6: can inline changeSmallStruct

./main.go:17:6: can inline main

./main.go:20:19: inlining call to changeSmallStruct

./main.go:23:38: inlining call to newSmallStruct

./main.go:9:9: &SmallStruct{...} escapes to heap

./main.go:12:24: arg does not escape

./main.go:19:17: &SmallStruct{...} does not escape

./main.go:23:38: &SmallStruct{...} does not escape

You may add -m=2 to increase verbosity.

Conclusion

Memory management is fundamental to a programming language's efficiency and performance. Transitioning from languages like C# or Java, one might expect similar garbage collection behavior in Go, yet the approaches differ significantly. Developing proficiency in a language involves aligning with its design principles. That can only be achieved by being informed about internals.