yfractal (yang)

Recurrent Neural Network (RNN) Introduction

yfractal — Fri, 26 Dec 2025 09:18:14 +0800

原文： https://github.com/yfractal/blog/blob/master/blog/2025-12-23-rnn-introduction.md

代码实现：https://github.com/yfractal/rnn-rb

How SDB Scans the Ruby Stack Without the GVL

yfractal — Wed, 15 Jan 2025 23:03:05 +0800

链接 https://github.com/yfractal/blog/blob/master/blog/2025-01-15-non-blocking-stack-profiler.md

这篇文章主要是介绍为什么 SDB 扫描 Ruby 栈的时候没有使用全局锁，但仍然可以得到想要的结果的。

一个简单的栈分析器（Stack Profiler）

yfractal — Wed, 08 Jan 2025 08:23:32 +0800

https://github.com/yfractal/sdb_signal

这个主要是用来和 SDB 做性能对比的，是一个非常简单的 Stack Profiler。

因为简单，可以用来了解 Ruby 栈分析器的实现，以及如何用 Rust 写 Ruby extension。

一般的栈分析器，大体是先设置 signal，之后每隔一段时间，比如 1ms，触发 signal。在这个 signal handler 里，用 Ruby 内置的 rb_profile_thread_frames 扫描线程当前的栈。最后 rb_profile_frame_full_label 等方法获得 symbol。

用法：

require 'sdb_signal'

def foo(n)
  if n == 0
    sleep 10000000
  else
    foo(n - 1)
  end
end


threads = []
5.times do |i|
  threads << Thread.new do
    foo(10)
  end
end

SdbSignal.setup_signal_handler
SdbSignal.start_scheduler(threads)

# 这里没有输出，需要输出的话，可以找到相应的 Rust 代码打印或者打 log。

SDB 会稍微复杂一些，因为它的定位是 none blocking，不使用全局锁，也就没法用 Ruby 的 rb_profile_thread_frames 之类的方法，需要写扫描逻辑。架构上有一点优化，再就是并发性能上有一些考量（因为要做到 none blocking），比如用了 spinlock、memory barrier 之类的做并发控制。

Understanding the Page Table Step by Step

yfractal — Wed, 01 Jan 2025 22:18:30 +0800

之前 page table 一直理解的不好，最近在重读 xv6, a simple Unix-like teaching operating system 的时候发现，page table 就是一个特殊的 hash-map，虚拟地址的一部分作为 page table 的 key（index），最后几位作为 page 内的 offset（可以保证一个 page 内的内存是连续的，并节省内存）。而多层 page table 是为了 lazy allocate memory 从而达到节省内存的目的。

写了篇文章作为记录：https://github.com/yfractal/blog/blob/master/blog/2025-01-01.md

SDB generated RubyChina(Homeland) call graph

yfractal — Fri, 18 Oct 2024 23:34:11 +0800

The begin:

....

The bottom

A brief:

I can't upload the origin image, it's too large.

The image is generated by https://github.com/yfractal/sdb, a Ruby stack profiling tool under the experiment stage.

Observing Puma Thread Scheduling through eBPF

yfractal — Tue, 08 Oct 2024 09:04:04 +0800

Introduction

Observation tools can help us understand and improve system performance. In this article, I will introduce how to observe Puma thread scheduling through eBPF.

The eBPF code used in this example can be found here: https://github.com/yfractal/sdb/blob/main/scripts/thread_schedule.py

How It Works

eBPF[1] allows us to probe kernel functions. We can use the command sudo bpftrace -l | grep -E "kprobe|kfunc" to find all available kernel functions.

Inspired by BCC offcputimfe.py, I use finish_task_switch as the instrumentation point. This function is called after the context switched to the new task(thread)[2], and we can get the previous task (thread) through the prev argument, and the current thread ID is the task that has been switched to. Its signature is:

static struct rq *finish_task_switch(struct task_struct *prev) __releases(rq->lock);

The program is straightforward: it records the start timestamp for the current thread. When the system suspends the thread, the thread occurs in the prev argument. At that point, we record the end timestamp and submit the event.

int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
    u32 pid, tgid;
    u64 ts = bpf_ktime_get_ns();

    // current task
    u64 pid_tgid = bpf_get_current_pid_tgid();
    tgid = pid_tgid >> 32;
    pid = (__u32)pid_tgid;

    struct event_t event = {};
    event.pid = pid;
    event.tgid = tgid;
    bpf_get_current_comm(&event.name, sizeof(event.name));
    event.start_ts = ts;
    events_map.update(&tgid, &event);

    // previous task
    pid = prev->pid;
    tgid = prev->tgid;

    struct event_t *eventp = events_map.lookup(&tgid);
    if (eventp == 0) {
        bpf_trace_printk("prev is nil");
        return 0;
    }
    eventp->end_ts = ts;
    events.perf_submit(ctx, eventp, sizeof(*eventp));

    return 0;
}

The Results

Next, I created a simple HTTP server using Roda, and I used a Ruby script to send HTTP requests. After collecting the events, I converted it into the Perfetto trace format.

The result looks like this (the trace is available here):

The puma srv 23630(the last thread in the image) is Puma’s server thread, which pulls ready I/O events through nio and distributes them to worker threads (the ThreadPool). So you can see that it is active for a very short period.

Others

One interesting finding is that when I use wrk for sending requests, I can barely see Puma’s server thread being active. This is because wrk enables keep-alive by default, and Puma reuses the previous connection, so the Puma server doesn’t need to wait for a new request.

Without eBPF, we only know the system schedules threads, but we don't know how frequently this happens or how long a thread runs. This visibility helps us understand the system better.

Next, I plan to link scheduling events with lock events to understand how the GVL and other locks affect a Ruby HTTP server.

Symbolizing Ruby ISeq Through eBPF

yfractal — Thu, 03 Oct 2024 10:24:36 +0800

Introduction

A stack profiler scans the function stack, where we can find the function's address. To make this address meaningful, we need to retrieve the function name and other information—a process known as symbolization.

In this article, I will introduce how to symbolize Ruby instructions using eBPF and explain why I chose eBPF for this purpose. Its code is here https://github.com/yfractal/sdb/pull/7.

Background

We can think of the Ruby VM as a stack machine[1]. When it executes a function, it pushes the function address(ISeq) onto its stack, which is an array of rb_control_frame_struct. Simplified code is shown below:

A stack profiler can scan the rb_control_frame_struct array and retrieve the functions that are currently executing.

Ruby natively supports this through rb_profile_frames, which fetches relevant information (iseq and line number). We can then retrieve additional details using functions like rb_profile_frame_method_name. Several tools make use of this approach, such as Shopify's stack_frames.

Why Another Stack Profiler?

Ruby already has several stack profiling tools, such as stackprof and Shopify's stack_frames. These tools use rb_profile_frames, which requires holding the Global VM Lock (GVL), blocking the execution of all other threads. Although Ruby has Reactor, it still blocks all threads within the Reactor, and the Reactor doesn’t seem to be widely adopted. Even without considering the GVL, these tools run in the application thread, adding additional delays to the application.

https://github.com/yfractal/sdb solves these issues by pulling stack frames without holding the GVL (see this code). As it doesn’t affect application threads, it can be used on the fly, even in production environments.

Troubles After Releasing the GVL

The Ruby GVL ensures VM data integrity, which includes the ISeq. When fetching an ISeq's fields, we need to get the GVL back. For performance reasons, we need to retrieve the ISeq’s information in batch. And in the puller thread, we couldn’t keep ISeq’s reference(we do not have GVL). And when we retrieve ISeq’s information, they could be freed by GC. Then it can cause segment fault.

We could mitigate this by waiting for Ruby VM to load all the code, checking the ISeq type, or catching segmentation faults.

That said, we can still improve the process. If asynchronous ISeq retrieval is error-prone, we can opt for synchronous retrieval. While Ruby doesn't load code all the time and the performance impact is minimal, I believe this is a reasonable trade-off.

SDB eBPF Symbolizer

eBPF allows us to probe both kernel and user functions through kprobe and uprobe. It inserts a breakpoint instruction, and when this instruction is executed, it jumps to a predefined handler function[2][3].

We see, it executes code synchronously. So we can insert probes when the VM creates an ISeq and capture relevant information. Probing functions like rb_iseq_new_with_opt and rb_iseq_new_with_callback serve this purpose well.

Using bcc makes this relatively simple:

b = BPF(text=bpf_text)
binary_path = "/home/ec2-user/.rvm/rubies/ruby-3.1.5/lib/libruby.so.3.1"
b.attach_uprobe(name=binary_path, sym="rb_iseq_new_with_opt", fn_name="rb_iseq_new_with_opt_instrument")
b.attach_uretprobe(name=binary_path, sym="rb_iseq_new_with_opt", fn_name="rb_iseq_new_with_opt_return_instrument")

In rb_iseq_new_with_opt_instrument, we can get arguments by PT_REGS_PARMX. For example, in Ruby 3.1.5, the second argument of rb_iseq_new_with_opt is the function’s name, which we can obtain as follows:

struct RString *name;
bpf_probe_read(&name, sizeof(name), (void *)&PT_REGS_PARM2(ctx));

Since RString is not a C string, we need to convert it to a C string. However, because eBPF operates in a sandboxed environment and cannot call user-space functions, we need to implement the conversion ourselves.

Here is a simple implementation:

static inline int read_rstring(struct RString *str, char *buff) {
    u64 flags;
    char *ptr;
    unsigned long len;

    bpf_probe_read(&flags, sizeof(flags), &str->basic.flags);

    // Check if the string is embedded or heap-allocated
    if (flags & (1 << 13)) {
        bpf_probe_read(&len, sizeof(len), &str->as.heap.len);
        bpf_probe_read(&ptr, sizeof(ptr), &str->as.heap.ptr);

        if (ptr) {
            bpf_probe_read_str(buff, sizeof(buff), ptr);
        }

        return 1;
    } else {
        int len = get_embed_ary_len(str->as.embed.ary, MAX_STR_LENGTH);
        bpf_probe_read_str(buff, sizeof(buff), str->as.embed.ary);

        return 2;
    }
}

After obtaining the necessary information, we can submit it to the user program.

BPF_PERF_OUTPUT(events);

// rb_iseq_t *
// rb_iseq_new_with_opt(const rb_ast_body_t *ast, VALUE name, VALUE path, VALUE realpath,
//                      VALUE first_lineno, const rb_iseq_t *parent, int isolated_depth,
//                      enum iseq_type type, const rb_compile_option_t *option)
int rb_iseq_new_with_opt_instrument(struct pt_regs *ctx) {
    struct event_t event = {};

    struct RString *name;
    bpf_probe_read(&name, sizeof(name), (void *)&PT_REGS_PARM2(ctx));
    read_rstring(name, event.name);

    events.perf_submit(ctx, event, sizeof(*event));
    return 0;
}

Then, the data can be read in the user program as below:

def print_event(cpu, data, size):
    event = ctypes.cast(data, ctypes.POINTER(Event)).contents
    print(json.dumps(event.to_dict()))

b["events"].open_perf_buffer(print_event, 1024)

while True:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

The full code is here https://github.com/yfractal/sdb/pull/7.

Others

Probing ISeq creation alone is not enough. ISeq could be moved to other places during GC compacting. To detect this, we could probe gc_move and record the scan(source) and free(destination) address. As Ruby disables GC compaction by default, I leave it as a future work.

static VALUE gc_move(rb_objspace_t *objspace, VALUE scan, VALUE free, size_t slot_size);

Besides eBPF, binary instrumentation or ptrace could offer better alternatives, as they can access the application’s functions. However, since https://github.com/yfractal/sdb is still experimental, I chose eBPF for its simplicity.

References

Ruby Under a Microscope.
https://docs.kernel.org/trace/kprobes.html#id2
https://eli.thegreenplace.net/2011/01/27/how-debuggers-work-part-2-breakpoints

无 root 权限、证书查看 Ruby HTTPS 请求内容

yfractal — Sun, 15 Sep 2024 00:19:34 +0800

Introduction

本文的代码在 https://github.com/yfractal/sdb/tree/main/sdb-shim

在开发或者排查问题的时候，时常会需要查看请求内容。比如著名的 tcpdump 可以查看 http 内容。https 也有相应的工具，比如基于 eBPF 的 ecapture（eBPF 是为了区分早起的 BPF，但有的人建议还是用 bpf，另 tcpdump 是用 bpf 的，且作者都是 Van Jacobson）。

但这些工具都需要 root 权限，有些环境下，比如公司奇怪的安全要求或者在 docker 里，user 没有 root 权限。这个时候，想查看 https 的请求内容就非常困难。

本文介绍一种既不需要更改 Ruby 代码，也不需要 root 权限，查看 https 请求内容的方法。

How it works

我们可以让应用程序自己告诉我们，它的请求内容是什么。最直接的方法是，我们复写 Ruby http 请求方法，当请求反回后，打印解密后的内容。

但这样，就需要引入一个 Gem，如果是编译型语言，比如 Go，还需要重新编译。

eBPF 的 uprobe，会在方法的地址，插入一个 trap instruction 比如 int3，当执行到该地址，不会执行原有方法，而是跳转到指定地址。但 eBPF uprobe 需要 root 权限。

而我们既要避免侵入代码，又要在非 root 权限下运行，为了既要又要，在 Mac 系统下，我们可以使用 __interpose section 替换掉原有方法，然后用 DYLD_INSERT_LIBRARIES 链接入程序。达到类似 Ruby alias method 的目的。

How to implement

首先我们要找到合适的方法。Ruby 使用 openssl 进行加密解密。但我对 Ruby 标准库和 openssl 并不熟悉，直接看代码会比较麻烦。

可以用 Ruby 构造请求，并用 stack profiling tool 查看请求了哪些方法，从而缩小范围。我用的是 https://github.com/yfractal/sdb，在这个并不好用的工具下，它帮我把位置定位到 read_nonblock openssl/buffering.rb:204。

之后通过 debug 和阅读代码，可以知道，Ruby 调用 SSL_read 进行解密。

找到对应的方法后，我们需要告诉 MacOS 做相应的替换。

#include "openssl/ssl.h"

struct __osx_interpose {
    const void* new_func;
    const void* orig_func;
};

static int Real__SSL_read (void *ssl, void *buf, int num) { return SSL_read (ssl, buf, num); }
extern int __interpose_SSL_read (void *ssl, void *buf, int num);

static const struct __osx_interpose __osx_interpose_SSL_read __attribute__((used, section("__DATA, __interpose"))) =
  { (const void*)((uintptr_t)(&(__interpose_SSL_read))),
    (const void*)((uintptr_t)(&(SSL_read))) };

在 __interpose_SSL_read 里，我们调用 SSL_read 得到解密后的内容。由于 http body 是被压缩过的，我们需要先找到 body 的位置（同时解密了 headers 和 body），并进行解压，之后就可以拿到可读的 body。

代码在 https://github.com/yfractal/sdb/blob/main/sdb-shim/src/https_instrument.c 。

Others

目前 https_instrument 还只是一个玩具，我只测试了一个最简单的例子。对我来说，写这样的工具是一件很有趣的事情。再者，公司的开发机，没有 root 权限，它毕竟也不是科技公司。。。

相比 eBPF，这种方法除了不用 root 权限外，开发起来也更容易，不需要额外的支持，还可以随便使用 library。

比如 opentelemetry-go-instrumentation 使用 eBPF 做 instrument，但 eBPF 是单独的内存空间，操作复杂的 Go 数据结构就极其困难，比如 hash map。

不过 linux 并不直接支持这种方法，但可以用 LD_PRELOAD 替换动态连结库的方法，相应代码。我之前的文章也有相关的介绍。

LD_PRELOAD，虽然可以 instrument openssl，但没法改程序本身的代码。理论上，通过改 binary，比如在相应的地址插入 int3，生成新的 bianry，应该可以达到类似的效果，或者直接在编译的时候做相应操作，再或者改 ELF。

相比 eBPF，个人更喜欢 function Interposing 这种方法做 instrument。虽然需要应用配合，但比 eBPF 更可控。更重要的是，开发更简单也更灵活。

https://github.com/yfractal/sdb 目前来说，也是一个玩具，不但不好用，还会 segment fault。

https_instrument 应该还有很多问题，我会在个人使用过程中慢慢完善。如果真的有人需要的话，我再想办法让它用起来更简单。

Detect Ruby GVL contention through dynamic link library functions

yfractal — Tue, 10 Sep 2024 22:05:55 +0800

Introduction

Ruby's Global VM Lock (GVL) protects the Ruby VM's data but reduces parallel execution because only one thread can hold the lock at a time.

The GVL can affect application performance. For example, in a Puma server with several threads, when one thread holds the lock, it causes delays for other threads.

Ruby 3.2 introduced a GVL instrumentation API, and there are several tools for visualizing it. However, such observability requires Ruby VM support. Ruby VM supported observability development is slow, hard to cover all scenarios, and adds maintenance overhead for Ruby.

This article explores a more dynamic solution that provides similar observability without modifying Ruby code. It uses LD_PRELOAD[1] and dlsym[2] to wrap pthread lock functions, achieving behavior similar to Ruby's alias method.

And the code is in https://github.com/yfractal/sdb

How it Works

Ruby’s GVL is implemented using mutex and conditional variable, which are loaded through the dynamic linker. On linux, the dynamic linker allows us to override those functions using LD_PRELOAD. In the overridden functions, we can log relevant events and locate the original function through dlsym. This approach is similar to Ruby's alias method but for dynamically linked functions.

#[no_mangle]
pub unsafe extern "C" fn pthread_mutex_lock(mutex: *mut pthread_mutex_t) -> i32 {
    // log acquire event ...
    if let Some(real_pthread_mutex_lock) = REAL_PTHREAD_MUTEX_LOCK {
        let ret = real_pthread_mutex_lock(mutex);
        // log acquired event ...
        ret
    } else {
        eprintln!("Failed to resolve pthread_mutex_lock");
        -1
    }
}

// then we could do similar things for pthread_mutex_unlock, pthread_cond_wait and pthread_cond_signal

To identify the mutex’s address, we need to access Ruby's rb_thread_t object.

Here’s a simplified version of the code:

pub unsafe extern "C" fn log_gvl_addr(_module: VALUE, thread_val: VALUE) -> VALUE {
    // find rb_thread_t from thread value
    let thread_ptr: *mut RTypedData = thread_val as *mut RTypedData;
    let rb_thread_ptr = (*thread_ptr).data as *mut rb_thread_t;

    // access gvl_addr through offset directly
    let gvl_addr = (*rb_thread_ptr).ractor as u64 + 344;
    let gvl_ref = gvl_addr as *mut rb_global_vm_lock_t;
    let lock_addr = &((*gvl_ref).lock) as *const _ as u64;

    // log gvl address ...
    rb_ll2inum(lock_addr as i64) as VALUE
}

Testing

I used the following script for testing:

// example.rb
require 'sdb'

Sdb.log_gvl_addr

threads = []
10.times {
  thread = Thread.new do
    Sdb.log_gvl_addr
    i = 0
    10000.times do
      i += 1
    end
  end
  threads << thread
}

threads.each {|thread| thread.join }

We can run it using the this command: LD_PRELOAD=./target/release/libsdb_shim.so bundle exec ruby example.rb(libsdb_shim.so is the compiled Rust file).

Then, we could see logs similar to these:

2024-09-10 21:09:11.540956679 [INFO] [lock] thread_id=281472580841568, rb_thread_addr=187651089870448, gvl_mutex_addr=187651083330256

2024-09-10 21:09:11.53981372 [INFO] [lock][mutex][acquire]: thread=281472580841568, lock_addr=187651083330256
2024-09-10 21:09:11.539815804 [INFO] [lock][mutex][acquired]: thread=281472580841568, lock_addr=187651083330256
2024-09-10 21:09:11.539816595 [INFO] [lock][cond][acquire]: thread=281472580841568, lock_addr=187651083330256, cond_var_addr=187651089870568
2024-09-10 21:09:11.540927137 [INFO] [lock][cond][acquired]: thread=281472580841568, lock_addr=187651083330256, cond_var_addr=187651089870568

Others

Does the GVL Matter?

Ruby uses the GVL to protect its VM and releases the lock during I/O operations. It's not bad for I/O-bound applications.

However, background threads or code instrumentation (like NewRelic) can not only consume CPU resources but also introduce delays to all Ruby application threads.

eBPF Solution

We could use eBPF to probe these functions without modifying the application, but eBPF programs usually require root privileges and have more dependencies.

LD_PRELOAD alters the application’s library loading but is a much lighter solution compared to eBPF.

Improvements

The code demonstrates how to use LD_PRELOAD and dlsym to instrument the Ruby VM without modifying Ruby code.

Since Ruby’s GVL is complex(it uses conditional variables and only acquires the lock when the GVL has an owner and the current thread is not the timer thread), instrumenting mutex and conditional variable doesn’t fully capture gvl_acquire and gvl_release. However, we can still infer GVL delays from the locking patterns.

The code logs events to a file, allowing for async analysis. We could use fast_log[4], which buffers logs in memory and writes them to a file in batches.

However, since Ruby VM accesses the GVL pretty frequently, the example.rb can generate over 80,000 lines of logs. Likes ldb[3], the performance could be further improved by logging lock events only when the delay exceeds a threshold.

Summary

The uses LD_PRELOAD and dlsym to instrument the GVL without modifying Ruby code. You can find the code at https://github.com/yfractal/sdb

References

https://man7.org/linux/man-pages/man8/ld.so.8.html
https://linux.die.net/man/3/dlsym
LDB: An Efficient Latency Profiling Tool for Multithreaded Applications
https://github.com/rbatis/fast_log

【译】垃圾回收和 Ruby RGenGC 简介

yfractal — Sun, 07 Jul 2024 11:10:23 +0800

背景

最近，我在做叫 Ccache 的实验项目（https://github.com/yfractal/ccache），该项目用 Rust 实现核心功能，并与 Ruby 和 Golang 等语言进行集成。Rust 是一种系统编程语言，没有垃圾回收（GC）进行内存管理。但 Ruby 和 Golang 确实使用 GC。这引出了一个有趣的问题：Rust 如何安全有效地与使用 GC 的语言进行交互？因此，我花了一点时间了解 Ruby 的 GC 是如何工作的。

介绍

在本文中，我将描述 Ruby 2.2 引入的 RGenGC（Restricted Generational GC）是如何工作的。为了使事情更容易理解，我会先解释垃圾回收的基本原理，然后描述 RGenGC 解决的独特问题及其机制和源代码。在这个过程中，我尽量忽略不必要的细节。

垃圾回收

程序从操作系统分配虚拟内存，操作系统将虚拟内存映射到物理内存（或其他资源，如文件）。由于物理内存的限制和性能要求，程序需要在使用后将内存归还。

程序使用两种类型内存：栈内存和堆内存。Rust 主要使用栈内存，这需要在分配内存之前知道变量的大小。对于动态大小的结构（如 vector），Rust 在堆上分配内存。为了安全高效地管理内存，Rust 不允许共享可变和循环引用。这些限制使 Rust 能够使用引用计数来管理堆内存，但实现一些数据结果（双向链表）会变得困难 [1]。

C 要求程序员手动分配和释放堆内存，难以使用并容易出错。Rust 使用生命周期（lifetime）和所有权 (owership) 进行内存管理，从而达到明确、安全、高效的目的。

像 Ruby 和 Go 使用 GC——程序员不需要考虑何时释放内存，因为 GC 会处理这个事情。所以 GC 需要高效地将不再使用的内存归还系统。

Mark and Sweep 算法

Mark and Sweep 是一种检测和释放不再使用 (dead) 内存的垃圾回收算法。

为了找到不再使用 (dead) 的内存，Mark and Sweep 算法通过引用遍历所有可访问（rechable）对象。未访问到的对象被视为不活跃（dead），可以被释放。既，标记阶段标识所有可达对象并将其标记。在清除阶段，这些标记的对象被保留，而其他对象被释放。

为了遍历所有可访问对象，Mark and Sweep 使用广度优先搜索（BFS）算法。它递归地查找每个对象的所有直接引用。为了避免无限循环，访问完一个对象的所有引用用，需要标记为已被访问。

该算法维护一个队列以保存当前已知的可达对象（object）。它将一个 item 出队，并将这个 item 的所有引用入队，并将这个 item 标记为已访问。如果 item 的引用已经被标记为访问过，则不会将其添加到队列中。此过程持续到队列为空，表示所有可被访问的对象已被访问。

下面是伪代码：

queue = init_queue()

for object in global_objects:
  object.visted = true
  queue.enqueue(object)


while queue.is_empty() == true:
  object = queue.dequeue
  for reference in object.references:
    if reference.visited == false
       reference.visited = true
       queue.enqueue(reference)

现在所有可达对象都已知，清除阶段将遍历所有对象。如果对象的已访问字段为 false，则表示对象不可访问，可以释放。

总结如下：

标记阶段：广度优先搜索标记所有活动对象。清除阶段：扫描内存以释放未标记的对象

Mark and Compact

Mark and Sweep 可以释放不反问的对象，但它不能处理碎片问题，会在内存中创建空洞。

由于释放的对象是不连续的，即使总内存足够大，也有可能无法为大结构分配内存。

Mark and Compact 解决了这个问题，在标记对象的同时，将可被访问的对象重新排列，从而使空闲内存连续。

其思路是将内存分为两半：FROM space 和 TO space。当 FROM space“满”时，标记阶段开始。在标记阶段，将可被访问的对象，移动到 TO space。由于其他对象可能引用了移动的对象，需要在对象的旧内存中记录给对象已经被移动到新的地址，既 forward reference。之后将该对象，引用到的对象移动到 TO space，并更新地址。标记阶段结束后，所有可达引用都已移动到 TO space，则可释放 FROM space。

也可以将 TO space 视为 Mark and Sweep 算法中的队列，由两个指针表示队列的开始和结束。

Generational Garbage Collection

为了释放 TO space，Mark and Compact 法必须扫描所有对象，非常耗时。Generational Garbage Collection 通过多数时候仅扫描新对象来改进这一点。

Generational GC 基于一个简单的发现：如果一个对象存在很长时间，它往往会存在更长时间 [3]，新对象更有可能被回收，这意味着我们多数时候只需要检查新对象。例如，接收请求时，Rails 会创建 controller 实例，请求结束后应回收该实例。然而，数据库连接实例可能已经存在一段时间，不应回收，也不需要被回收。

简单来说，Generational GC 将对象分为两类：新生代和老年代。新生代对象是最近创建的，而老年代对象已经存在了一段时间。目标是扫描新生代对象并释放相关内存。

一个对象要被标记为不可被访问（dead），需要保证没有引用指向它。在标记阶段，我们目标是只扫描所有新生代对象。如果一个老对象引用了一个新对象，我们对其特殊记录，并在标记阶段扫描这个对象。这样就可以保证，释放一个对象的时候，没有任何其他对象引用该对象。

当我们引用一个对象的时候，例如 a.b = &c，如果 a 和 c 是同一时代，或者 a 比 c 更年轻的话，不需要做特殊处理，因为在 mark 阶段，可以被正常扫描到。当老对象，引用到新对象的时候，既 a 比 c 更老，则需要记录 a，并在 mark 阶段，扫描 a。

Ruby 的 RGenGC

Ruby 需要解决的问题

在 Ruby 2.2 之前，Ruby 只有 non-generational mark-and-sweep GC[4]。为了支持 Generational GC，我们需要一个写屏障 (Write Barrier) 来记录老对象引用新对象的情况。然而，由于 Ruby 有很多代码，以及第三方 C 扩展的存在，使得 Ruby 没办法做到向后兼容。

为了解决这个兼容性问题，Ruby 团队引入了 Write-Barrier-Unprotected Objects 这个概念。

Write-Barrier-Unprotected Objects

Write-Barrier 就是之前提到的，当老对象引用新对象的时候，需要做的记录。Unprotected 指的是，存在老对象引用了新对象，但没有记录的情况。

由于不能让所有 Ruby 对象在老对象引用新对象时使用 Write-Barrier，我们无法知道这些对象是否引用了新对象。在 Minor GC 期间，需要扫描这些对象。

Ruby RGenGC 标记步骤 [4]

1. Put the root-set objects and objects referenced from the remembered set objects into the work queue.
2. Repeat the following until the work queue is empty:
a. Dequeue an object p from the work queue.
b. For each object c referenced from p do:
    i. If p is an old object:
       • If c is already marked, makes c an old object and add c to the remembered set.
       • If c is not marked and not an old object, makes c’s age two (becomes an old object at the next step).
   ii. Increment the age of c by one, mark c, and then put c to work queue if c was not marked and is not an old object. Note that, in our implementation, if the age of an object becomes 3, the object becomes an old object.

为了确保 Write-Barrier-Unprotected Objects 的安全，Ruby 会检测这些对象并将其放入一个集合中，以便在 minor GC 期间扫描。从而 Ruby 可以安全的回收对象。

以下是相关的 Ruby 2.2 代码（删除了和本文无关代码）[6]：

static void
gc_mark_ptr(rb_objspace_t *objspace, VALUE obj)
{
    rgengc_check_relation(objspace, obj);

    if (!gc_mark_set(objspace, obj)) return; /* already marked */
    gc_aging(objspace, obj);
    gc_grey(objspace, obj);
}

static void
rgengc_check_relation(rb_objspace_t *objspace, VALUE obj)
{
    const VALUE old_parent = objspace->rgengc.parent_object;

    if (old_parent) { /* parent object is old */
        if (RVALUE_WB_UNPROTECTED(obj)) {
            gc_remember_unprotected(objspace, obj)
        } else {
            if (!RVALUE_OLD_P(obj)) {
                if (RVALUE_MARKED(obj)) {
                    /* An object pointed from an OLD object should be OLD. */
                    RVALUE_AGE_SET_OLD(objspace, obj);
                    if (is_incremental_marking(objspace)) {
                        if (!RVALUE_MARKING(obj)) {
                            gc_grey(objspace, obj);
                        }
                    } else {
                        rgengc_remember(objspace, obj);
                    }
                } else {
                    RVALUE_AGE_SET_CANDIDATE(objspace, obj);
                }
            }
        }
    }
}

总结

本文介绍了几种基本的 GC 算法，以便更容易地理解 Ruby RGenGC。它没有涉及许多有趣的内容，例如并行垃圾回收器 [8]，只是讲解思路，所以有意地忽略了一些细节，比如 Mark and Compact 中，释放 TO space 之后，FROM space 会变成，TO sapce，TO space 变成 FROM space，既翻转 (flip)，因为这些并不影响理解算法主体，并且容易理解。更多的细节，可以参考一下链接 [2][3][4][5][7][8]。

原文： https://ruby-china.org/topics/43798

引用

GhostCell: Separating Permissions from Data in Rust
https://ruby-china.org/topics/32226
A Real-Time Garbage Collector Based on the Lifetimes of Objects
Gradual Write-Barrier Insertion into a Ruby Interpreter
https://blog.peterzhu.ca/notes-on-ruby-gc/
https://github.com/ruby/ruby/releases/tag/v2_2_1
https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/resources/lecture-11-storage-allocation/
https://inside.java/2022/08/01/sip062/

Garbage Collection 101 and Ruby's RGenGC (Restricted Generational GC)

yfractal — Sat, 06 Jul 2024 11:37:07 +0800

Background

Recently, I’ve been working on an experimental project called Ccache(https://github.com/yfractal/ccache), which implements core functions in Rust and integrates with other languages such as Ruby and Golang. Rust is a systems programming language that doesn’t use garbage collection (GC) for memory management. However, Ruby and Golang do use GC. This raises an interesting question: how can Rust interact safely and effectively with languages that use GC? To answer this question, I spent some time understanding how Ruby's GC works.

Introduction

In this article, I will describe how RGenGC (Restricted Generational GC), introduced in Ruby 2.2, works. To make things easier, I will first explain how garbage collection works in general, and then describe the unique problems that RGenGC solves, along with its mechanisms and source code.

Garbage Collection 101

Programs allocate virtual memory from the operating system, which maps the virtual memory to physical memory (or other resources such as files). Due to the limitations of physical memory and performance requirements, programs need to return the memory back once they are done using it.

Programs use two kinds of memory: stack and heap. Rust primarily uses stack memory, which requires knowing the size of variables before allocating memory. For dynamically sized structures, such as vectors, Rust allocates memory on the heap. To manage memory safely and efficiently, Rust doesn’t allow shared mutable and cyclic references. These restrictions allow Rust to use reference counting for managing heap memory but make it challenging to implement certain structures, like doubly linked lists[1].

C requires programmers to allocate and free heap memory manually, which is extremely difficult to manage correctly. Rust uses lifetimes and ownership to make memory management explicit and safe, minimizing performance costs as a system language.

Languages like Ruby and Go use GC—programmers do not need to consider when memory is freed, as it’s handled by the GC. Simply put, GC needs to return unused memory back to the system efficiently.

Mark and Sweep

Mark-and-sweep is a garbage collection algorithm that detects and frees inactive memory.

To find unused memory, the mark-and-sweep algorithm traverses all reachable objects through references. Unvisited objects are deemed inactive and can be freed. The mark phase identifies all reachable objects and marks them as such. In the sweep phase, these marked objects are retained while the others are freed.

To traverse all reachable objects, it uses the breadth-first search (BFS) algorithm. It recursively finds all direct references to each item. To avoid infinite loops, once all direct references of an item are identified, the item is marked as visited, ensuring it won't be revisited.

The algorithm maintains a queue to hold currently known reachable items. It dequeues an item, enqueues all its references, and marks it as visited. If a reference has already been visited, it isn’t added to the queue. This process continues until the queue is empty, indicating all reachable items have been visited.

Here's the pseudocode:

queue = init_queue()

for object in global_objects:
  object.visted = true
  queue.enqueue(object)


while queue.is_empty() != true:
  object = queue.dequeue
  for reference in object.references:
    if reference.visited == false
       reference.visited = true
       queue.enqueue(reference)

Now that all reachable objects are known, the sweep phase involves looping through all objects. If an object’s visited field is false, it means the object is unreachable and can be freed.

To summarize, the process is:

Mark stage: Breadth-first search marked all of the live objects. Sweep stage: Scan over memory to free unmarked objects.

Mark and Compact

While mark-and-sweep can free unreachable objects, it doesn’t deal with fragmentation, which creates holes in memory.

As the freed objects are discontinuous, memory cannot be allocated for large structures even if there is enough total memory available.

Mark and Compact solves this issue by not only freeing unused memory but also rearranging live objects to make free memory contiguous.

The idea is to divide memory into two halves, FROM space and TO space. When FROM space is "full," the mark phase starts. During the mark phase, objects are moved from FROM space to TO space. Since other objects may refer to the moved object, a forward reference is recorded in the object’s old memory. The object’s references are also moved to TO space, and their pointers are updated to the new address. After the marking phase, all reachable references have been moved to TO space, and the old space can be freed.

You can also consider TO space as the queue in the mark-and-sweep algorithm, with two pointers representing the start and end of the queue.

Generational Garbage Collection

To release TO space, the mark-and-compact algorithm has to scan through all objects, which is time-consuming. Generational garbage collection improves this by scanning only new objects most of the time.

Generational garbage collection is based on a simple observation: if an object lives for a long time, it tends to live even longer[3]. This means we only need to check new objects most of the time. For example, when a request comes in, Rails creates a controller instance, which should be reclaimed after the request finishes. However, DB connection instances can live for a while and should not be reclaimed.

To simplify, objects are divided into two categories: new generation and old generation. New-generation objects have been created recently, while old-generation objects have been around for a while. The goal is to free memory related to new-generation objects after scanning them.

For an object to be marked as dead, there must be no references to it. During the marking phase, we scan all new-generation objects. If an old object references a new object, we need to record it and scan it during the marking phase.

When we create a reference to an object, we need to look at the objects that reference it. If the referencing objects are in the same generation or the new generation, they can be considered in our marking phase. For objects that are older than the referenced object, they are recorded separately and also considered in our marking phase. By considering all these objects, we can safely mark an object as dead during the marking phase.

Ruby RGenGC

The Write Barrier

During generational garbage collection, we scan young generation objects most of the time to reduce GC cost. To do this safely, we need to record when an old object references a young object. This record is called a Write Barrier.

The Unique Problem Ruby Faces

Before Ruby 2.2, Ruby only had a non-generational mark-and-sweep GC[4]. To support generational GC, we need a write barrier to record when an old object references a new object. However, this is challenging because Ruby has a large code base and many third-party C extensions.

To solve this compatibility issue, the Ruby team created the concept of Write-Barrier-Unprotected Objects.

Write-Barrier-Unprotected Objects

Since we can't let all Ruby objects use a Write Barrier when an old object references a new object, we don't know if these objects reference new objects or not. During minor GC, we must scan these objects.

Ruby RGenGC Marking Steps without WB-unprotected objects[4]:

1. Put the root-set objects and objects referenced from the remembered set objects into the work queue.
2. Repeat the following until the work queue is empty:
a. Dequeue an object p from the work queue.
b. For each object c referenced from p do:
    i. If p is an old object:
       • If c is already marked, makes c an old object and add c to the remembered set.
       • If c is not marked and not an old object, makes c’s age two (becomes an old object at the next step).
   ii. Increment the age of c by one, mark c, and then put c to work queue if c was not marked and is not an old object. Note that, in our implementation, if the age of an object becomes 3, the object becomes an old object.

To make Write-Barrier-Unprotected Objects safe, Ruby detects these objects and puts them in a set for scanning during minor GC.

Based on the above steps, Ruby ensures the safety of Write-Barrier-Unprotected Objects by detecting and recording them in a set.

Here is the relevant Ruby 2.2 code (simplified by removing unrelated code)[6]:

static void
gc_mark_ptr(rb_objspace_t *objspace, VALUE obj)
{
    rgengc_check_relation(objspace, obj);

    if (!gc_mark_set(objspace, obj)) return; /* already marked */
    gc_aging(objspace, obj);
    gc_grey(objspace, obj);
}

static void
rgengc_check_relation(rb_objspace_t *objspace, VALUE obj)
{
    const VALUE old_parent = objspace->rgengc.parent_object;

    if (old_parent) { /* parent object is old */
        if (RVALUE_WB_UNPROTECTED(obj)) {
            gc_remember_unprotected(objspace, obj)
        } else {
            if (!RVALUE_OLD_P(obj)) {
                if (RVALUE_MARKED(obj)) {
                    /* An object pointed from an OLD object should be OLD. */
                    RVALUE_AGE_SET_OLD(objspace, obj);
                    if (is_incremental_marking(objspace)) {
                        if (!RVALUE_MARKING(obj)) {
                            gc_grey(objspace, obj);
                        }
                    } else {
                        rgengc_remember(objspace, obj);
                    }
                } else {
                    RVALUE_AGE_SET_CANDIDATE(objspace, obj);
                }
            }
        }
    }
}

Summary

This article introduced several basic GC algorithms to help understand Ruby's RGenGC more easily. It doesn’t cover many interesting topics such as the Parallel Garbage Collector[8] because the focus is on the basics and those related to Ruby. Some details are intentionally omitted, such as the flip in Mark and Compact, as they are obvious and easy to understand. For more details, you can refer to these links: [2][3][4][5][7][8].

References

GhostCell: Separating Permissions from Data in Rust
https://ruby-china.org/topics/32226
A Real-Time Garbage Collector Based on the Lifetimes of Objects
Gradual Write-Barrier Insertion into a Ruby Interpreter
https://blog.peterzhu.ca/notes-on-ruby-gc/
https://github.com/ruby/ruby/releases/tag/v2_2_1
https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/resources/lecture-11-storage-allocation/
https://inside.java/2022/08/01/sip062/

Calling Rust from Ruby: How Rutie Works

yfractal — Thu, 27 Jun 2024 21:46:43 +0800

Introduction

Recently, I've been developing an experimental project called ccache(https://github.com/yfractal/ccache), which is a Redis client-side caching that guarantees consistency. Since it operates on the client side, I need to ensure it supports different programming languages.

The common practice is to write similar logic in different languages, as seen with the Redis client and OpenTelemetry instrument library. However, this approach involves tedious and repetitive work.

One potential solution is to write the core functionality in Rust and integrate it with different languages. To achieve this, we need to address the discrepancies between Rust and other languages, such as how to represent data and manage memory safely.

In this article, I will introduce Rutie, which bridges the gap between Ruby and Rust.

How it works

Ruby MRI is written in C, so it natively works well with C. The idea is to write Rust code in the C ABI and ensure memory safety.

Ruby Calls Rust Functions

First, we can let cargo compile Rust files into a dynamic library by specifying crate-type = ["dylib"] in Cargo.toml.

Then we can calls the function through fiddle[2].

For example:

#[allow(non_snake_case)]
#[no_mangle]
pub extern "C" fn rust_method() {
  println!("hello from Rust");
}

handle = Fiddle.dlopen("./target/release/libruby_example.dylib")
Fiddle::Function.new(handle)[rust_method], [], Fiddle::TYPE_VOIDP).call

Bind C Functions to Ruby Class

Ruby allows us to define Ruby methods through C, which is more convenient than using fiddle. For example, void rb_define_method(VALUE klass, const char *name, VALUE (*func)(ANYARGS), int argc) is used to define an instance method for a class. The first argument is the class, the second argument is the method’s name, and the third argument is the callback function.

After defining the method through rb_define_method(SomeClass, "a_method", call_back_ptr, -1):

When the method is called in the Ruby VM, for example SomeClass.new().a_method(), Ruby calls the callback function. The callback function receives the argument count, arguments array, and the object(the self in Ruby). For example, the callback function in C could be:

static VALUE
ruby_insert(int argc, VALUE *argv, VALUE self) {
  // ...
}

Bind C Struct to Ruby Object

C and Rust use struct and define methods for structs. By binding a struct to a Ruby object allowing us reusing exist code. Ruby achieves this through rb_data_typed_object_wrap and rb_check_typeddata methods.

rb_data_typed_object_wrap creates a new instance with the struct. Its signature is VALUE rb_data_typed_object_wrap(VALUE klass, void *datap, const rb_data_type_t *type). datap is the pointer to our struct, and the return value is the created instance.

Then we can use rb_check_typeddata to find the struct. Its signature is void * rb_check_typeddata(VALUE obj, const rb_data_type_t *data_type). The first argument is the Ruby instance, and it returns the struct’s pointer.

Making Ruby Work with Rust

Binding Rust Methods to Ruby Classes

In the sections above, I explained how Ruby works with C. Now, I will introduce how Rust works with Ruby.

Rust allows us to define C functions using the extern keyword:

pub extern fn ruby_insert(
    argc: ::rutie::types::Argc,
    argv: *const ::rutie::AnyObject,
    mut self: SomeClass,
) -> AnyObject {
  // ......
}

Then the method can be bound to a class through:

Class::new("SomeRubyClass", None).define(|klass| {
    klass.def("rs_insert", ruby_insert);
});

Binding Rust Structs to Ruby Objects

To bind a struct to a Ruby object, we need to manage memory properly since Ruby uses garbage collection (GC) while Rust relies on ownership. rutie solves this problem by delegating the struct’s memory management to Ruby. When it wraps data, it bypasses memory management using Box::into_raw(Box::new(data)) as *mut c_void (in the Class::wrap_data method). This allocates memory on the heap through Box::new and then bypasses Rust's memory management through Box::into_raw, meaning Rust doesn’t free the variable when it goes out of scope. When Ruby reclaims the struct-wrapped object, it also frees the struct. When wrapping the struct by calling rb_data_typed_object_wrap, the last argument includes the free callback, which is:

pub extern "C" fn free<T: Sized>(data: *mut c_void) {
    // Memory is freed when the box goes out of the scope
    unsafe {
        let _ = Box::from_raw(data as *mut T);
    };
}

in Rutie.

Moreover, Rutie provides several macros and methods to make this process easier:

wrappable_struct!(SomeStruct, SomeStructWraper, SOME_STRUCT_WRAPPER); 
class!(SomeRubyClass);
class::wrap_data(Class::from_existing("SomeRubyClass").value(), some_struct, &*SOME_STRUCT_WRAPPER);

wrappable_struct! defines a wrapper struct:

pub struct SomeStructWraper<T> {
    data_type: ::rutie::types::DataType,
    _marker: ::std::marker::PhantomData<T>,
}

PhantomData is used for referencing T, and data_type is for the last argument of rb_data_typed_object_wrap. The macro defines a global variable SOME_STRUCT_WRAPPER for use, similar to a singleton in Ruby[3].

class!(RubySomeStruct); defines a Ruby class for wrapping.

class::wrap_data(Class::from_existing("RubySomeStruct").value(), some_struct, &*SOME_STRUCT_WRAPPER); is called in initialize for binding. As mentioned, SOME_STRUCT_WRAPPER contains SomeStructWrapper, which has a free method in it

pub extern fn ruby_initialize(
    argc: ::rutie::types::Argc,
    argv: *const ::rutie::AnyObject,
    mut rtself: RubyStore,
) -> AnyObject {
    return class::wrap_data(Class::from_existing("RubySomeStruct").value(), some_struct, &*SOME_STRUCT_WRAPPER);
}

For using the struct:

pub extern fn ruby_insert(
    argc: ::rutie::types::Argc,
    argv: *const ::rutie::AnyObject,
    mut rtself: RubyStore,
) -> AnyObject {
   let rs_struct = rtself.get_data_mut(&*SOME_STRUCT_WRAPPER);
   // ......
}

We need the global variable SOME_STRUCT_WRAPPER because get_data_mut calls rb_check_typeddata, whose third argument is rb_data_type_t provided by the struct wrapper’s data_type field.

Memory Safety

A wrapped Rust struct is safe because the responsibility for freeing memory has been delegated to Ruby, allowing Ruby to free the memory during garbage collection (GC).

Variables passed to Rust are structures of pointers, so Rust does not free the contents the pointers point to. Since this is safe in C, it is safe in Rust as well.

The return variables are allocated through Ruby's memory system, and when they return, their ownership is passed to the caller—Ruby—which is also safe.

However, it becomes unsafe when Rust wants to keep a reference in its struct because Ruby doesn't know that Rust holds the reference, and Ruby might free it, causing a use-after-free issue. A simple solution is when Rust wants to hold a reference in Arc, it needs to let Ruby keep an additional reference to the object. And when Rust drops the Arc, it should let Ruby remove the additional reference.

Summary

This article introduced how Rutie enables Ruby to use Rust code and discussed memory safety. Many details are not covered here, such as Rutie allows Rust call Ruby, how to build for different platforms, and how it translates structs and binds C methods. It's encouraged to use Rutie and explore its source code for a deeper understanding.

I believe Rutie can greatly benefit the Ruby community. Not only can Rust enhance performance, but it also allows Ruby to leverage Rust implementations, such as gRPC and OpenTelemetry metrics.

A complete example of using Rutie can be found at https://github.com/yfractal/ccache/tree/main/ccache_rb.

References

Rust2go: calls Go from Rust

yfractal — Mon, 24 Jun 2024 21:23:37 +0800

Introduction

Recently, I've been developing an experimental project called ccache, which is a Redis client-side caching that guarantees consistency. Since it operates on the client side, I need to ensure it supports different programming languages.

The common practice is to write similar logic in different languages, as seen with Redis client and OpenTelemetry instrument library. However, this approach involves tedious and repetitive work.

Rust2go is a practical FFI framework that enables calling Go from Rust. In this article, I will introduce how it works.

Benefits

Due to its low overhead and safety guarantees, Rust has been integrated into many other systems traditionally written in C, such as Ruby and Linux.

Integrating Rust with other high-level languages is beneficial as it can improve performance and reduce repetitive work.

For example, ByteDance reduced CPU usage by more than 30% after migrating a core service from Golang to Rust[2].

Additionally, OpenTelemetry supports 11 languages[3]. Using Rust for core functionality can significantly reduce development efforts and prevent inconsistencies between different language implementations.

How Rust2go Works

Calling Go Functions

After building the Go code into a library and linking it to Rust, the Go functions become accessible within the Rust project.

However, Rust and Go have different calling conventions, so Rust cannot directly call Go functions. One solution is to use a trampoline to handle this issue[4]. Due to the unstable Rust ABI and the desire to address goroutine stack expansion, the author of rust2go chose not to use this method.

rust2go uses the C ABI as a "bridge" between Rust and Go. The Go functions are exposed as C functions through cgo, and Rust calls these C functions.

Memory Representation

Rust and Go represent structs in different ways. In Rust2go, a struct is first converted to a C struct and then to a Go struct. For example, a Rust struct DemoUser is converted to DemoUserRef and then to a Go DemoUser.

pub struct DemoUser {
    pub name: String,
    pub age: u8,
}

typedef struct DemoUserRef {
  struct StringRef name;
  uint8_t age;
} DemoUserRef;

func newDemoUser(p C.DemoUserRef) DemoUser {
    return DemoUser{
        name: newString(p.name),
        age:  newC_uint8_t(p.age),
    }
}

And then it coverts the primate types, for example, StringRef is converted to Go string by newString

func newString(s_ref C.StringRef) string {
    return unsafeString((*byte)(unsafe.Pointer(s_ref.ptr)), int(s_ref.len))
}

func unsafeString(ptr *byte, length int) string {
    sliceHeader := &reflect.SliceHeader{
        Data: uintptr(unsafe.Pointer(ptr)),
        Len:  length,
        Cap:  length,
    }
    return *(*string)(unsafe.Pointer(sliceHeader))
}

I will explain why Rust2go uses XXXRef in the next section.

Passing Variables Between Rust and Go

In the previous section, I explained how rust2go understands structs in Rust and Go. Now, I will explain how it passes variables between the two languages.

Passing Arguments to Go

The most simple and straightforward method is to use serialization protocols like Thrift and Protocol Buffers. Rust2go does not choose this method as it wastes CPU time converting the data back and forth.

Instead, it passes arguments through pointers and converts the data to make it understandable for Rust and Go. This avoids deep copying, such as strings and binary data.

This method adheres to Rust's safety rules because the arguments are "borrowed" by Go, and the memory is "owned" by Rust. Once Go finishes using the data, it frees its allocated memory, but the variables' memory allocated by Rust is not freed by Go.

Receiving Return Variables from Go

The return variables are created by Go, so Go can free them when necessary. Rust calling Rust does not have this problem because the variable can own the return result, such as let x = some_func();.

Rust2go handles this by copying the variable in the C callback so that Rust and Go can manage the "same" variable independently.

Summary

This article provides an introduction to how rust2go works. For more details, please refer to the author's article[2].

References

Memory Safety between Rust and Ruby — Making Ruby Allocated Memory Works in Rust

yfractal — Sat, 08 Jun 2024 17:55:42 +0800

Introduction

Breaking the barrier between different programming languages is both interesting and beneficial. For example, ByteDance developed rust2go[1] to facilitate migrating Golang projects to Rust smoothly[2].

Importing Rust into Ruby can not only improve performance but also reduce tedious, repetitive work. For instance, Rust has implemented OpenTelemetry metrics, whereas Ruby hasn't. Wrapping the Rust implementation for use in Ruby can save a lot of effort. Ccache[3] is an experimental project exploring this direction.

Ccache is an etag-based local cache that saves cache values in Arc, which are then queried by Ruby. It needs to consider how to handle memory efficiently between the two systems.

This article introduces an idea about how to make Rust Arc work between Rust and Ruby without copying.

Background

Programs allocate memory and, after usage, need to consider how to return that memory.

The most straightforward way is managing memory manually, where programmers need to know when a variable can be freed and release it to the system. This is how C works; however, it's error-prone, and many memory bugs can be found in C programs.

Rust improves this by providing ownership. Variables belong to a specific scope, and when execution goes out of that scope, Rust releases the memory. This makes memory management explicit in the code.

Arc shares variable usage by adding a reference. When code uses a variable, it increases the count, and after finishing its use, it decreases the count. Rust releases the memory when the reference count reaches zero.

Ruby's Garbage Collection (GC) allows programs to use memory without considering when to release it. The Ruby VM triggers GC when necessary, and during GC, it finds unreachable variables and returns their memory to the system or memory pool.

The problem arises when we use Rust's Arc and need to pass the value to the Ruby part.

Producing the Problem

An Example Using Rust `Arc`

pub struct Store {
    hash_map: HashMap<String, Arc<AnyObject>>,
}

impl Store {
    fn new() -> Self {
        Store {
            hash_map: HashMap::new(),
        }
    }
}

wrappable_struct!(Store, StoreWrapper, STORE_WRAPPER);
class!(RubyStore);

methods!(
    RubyStore,
    rtself,
    fn ruby_new() -> AnyObject {
        let store = Store::new();
        Class::from_existing("RubyStore").wrap_data(store, &*STORE_WRAPPER)
    },
    fn ruby_insert(key: RString, obj: AnyObject) -> AnyObject {
        let rbself = rtself.get_data_mut(&*STORE_WRAPPER);

        rbself
            .hash_map
            .insert(key.unwrap().to_string(), Arc::new(obj.unwrap()));
        NilClass::new().into()
    },
    fn ruby_get(rb_key: RString) -> AnyObject {
        let rbself = rtself.get_data_mut(&*STORE_WRAPPER);

        let key = rb_key.unwrap().to_string();
        let val = rbself.hash_map.get(&key).unwrap();
        AnyObject::from(val.value())
    },
);

#[allow(non_snake_case)]
#[no_mangle]
pub extern "C" fn Init_ruby_example() {
    Class::new("RubyStore", None).define(|klass| {
        klass.def_self("new", ruby_new);
        klass.def("insert", ruby_insert);
        klass.def("get", ruby_get);
    });
}

The Store struct is a simple HashMap, and its value is an Arc of Ruby AnyObject. It is for concurrent usage.

And it seems to work well:

it 'works' do
  store = RubyStore.new
  foo = Foo.new(1, 2)
  store.insert("key", foo)

  sleep 0.1
  GC.start

  expect(store.get("key").class).to eq Foo
  expect(store.get("key").a).to eq 1
  expect(store.get("key").b).to eq 2
end

The Segmentation Fault

The above example works because the created Foo object is still referenced by foo variable, so the memory has not been freed.

To trigger the segmentation fault or other memory issues, we create and pass the Foo object directly to RubyStore.

it 'has memory issues :(' do
  store = RubyStore.new
  store.insert("key", Foo.new(1, 2))

  sleep 0.1
  GC.start

  expect(store.get("key").class).to eq Foo
end

Then it raises a segmentation fault.

A Simple Solution

We can avoid this by deep cloning the Arc’s value; however, it is not zero cost. Serializing a large object may take more than 1ms in Ruby[3]. To improve performance, we need to consider other methods.

To work around this situation, one option is to let Rust allocate the memory and free it through drop. However, this means Rust needs to figure out whether the allocated memory is being used by Ruby, which is not feasible. Thus, memory must be allocated by Ruby.

Rust ownership is a great idea as it lets the owner manage its job. We need to consider the responsibilities between Ruby and Rust. The memory is allocated by Ruby, so Ruby has the duty to release it. Then the memory is used by Rust but it doesn’t own it. Thus, we can still use Arc; the only difference is that we do not return memory back when we drop the Arc.

pub struct RubyObject {
    value: rutie::types::Value,
}

impl Drop for RubyObject {
    // drop nothing, GC was handled by Ruby
    fn drop(&mut self) {
    }
}

Rust has done its job properly, so we need to consider Ruby's job now. In Ruby, it allocates memory from the system, so it needs to free the memory. Additionally, it passes the object to Rust, so it needs to record this.

klass.def("insert_inner", ruby_insert);

class RubyStore
  def insert(key, val)
    @_val = val
    insert_inner(key, val)
  end
end

The value is assigned to a local variable @_val, making it reachable through the RubyStore instance. This prevents Ruby's GC from reclaiming its memory. When the key is deleted, we can set @_val = nil to “free” its memory.

Now, everything works well.

Discussion

Clearly, the current solution is far from ideal. To improve it, we can use a doubly linked list to save all Arc references in a Ruby local variable. When Drop is called, instead of doing nothing, we can remove the reference from the doubly linked list.

For this solution, Rust depends on Ruby's GC for reclaiming memory, and it requires Ruby’s cooperation. Rust doesn’t trust programmers completely, so it isn’t strictly safe. However, Ruby works in another way; it assumes programmers can do the right things (though they often don’t). Thus, this solution is acceptable for Ruby. To make it safer and cleaner, we can handle the references things in Rust code.

Another interesting direction is to make ownership works in Ruby, not only for Arc, but also for mutable/immutable, and lock usage. This can make the interaction smoother and make Ruby safer.

Summary

This article discussed a method to integrate Rust's Arc into Ruby, ensuring memory safety without the need for deep cloning. You can find the whole example in rust_arc_demo[4] and a use case in Ccache.

References

eBPF USDT in Rust

yfractal — Wed, 22 May 2024 22:42:51 +0800

写了一篇文章介绍如何使用在 Rust 里使用 USDT。

然后这里是一个使用 USDT 的例子 https://github.com/yfractal/ccache/pull/7/files

Redis 真的很快吗？

yfractal — Sun, 30 Oct 2022 12:48:50 +0800

主要解释了 Redis 快的原因，以及 Redis 还不够快。

上传不了截图，我先放个链接，等能上传图片的时候，我再搬过来。

https://www.zhihu.com/question/563209865/answer/2736513961

最近在做 in memory cache 的调研，所以有了这个知乎的回答。

有人想聊聊 Shopify 新出的 app server pitchfork 吗？

yfractal — Sun, 09 Oct 2022 10:18:50 +0800

JeMalloc 相关资料

yfractal — Sat, 08 Oct 2022 15:09:44 +0800

Background

最近对 JeMalloc 的实现比较好其，所以看了一些相关资料，以及代码实现。下面是一些资料整理。

Resources

General

Pseudomonarchia jemallocum
- link: http://phrack.org/issues/68/10.html
- 虽然是分析 JeMalloc 安全相关的问题，但对 JeMalloc 的实现有很好的分析，是一篇高质量的技术文章，唯一缺点是描述的版本比较旧。
A Scalable Concurrent malloc(3) Implementation for FreeBSD
- link: https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf
- JeMalloc 作者 Jason Evans 的 paper。
Scalable memory allocation using jemalloc
- link: https://www.facebook.com/notes/10158791475077200/
- Facebook 的技术博客，对应 JeMalloc 版本是 2.1.0。
How JeMalloc Works
- link: https://github.com/yfractal/blog/blob/master/blog/2022-10-05-jemalloc.md
- 个人做的总结，关注点是 JeMalloc 内存申请和释放的过程以及各个结构体间的关系，可以用来大体了解 JeMalloc 的机制，对应版本 5.2.1。

源码分析

JeMalloc 源码分析
- link: https://youjiali1995.github.io/allocator/jemalloc/
- 内容详尽，对应的版本应该是 5.0.1
JeMalloc
- link: https://zhuanlan.zhihu.com/p/48957114
- 对应版本是 5.1.0

Purging 相关

Tick Tock, malloc Needs a Clock
- link:
- JeMalloc 作者 Jason Evans 关于 purging 相关的分享。
Implement decay-based unused dirty page purging
- link: https://github.com/jemalloc/jemalloc/issues/325
- 关于 decay-based purging 的设计思路。
Implement two-phase purging
- link: https://github.com/jemalloc/jemalloc/issues/521
jemalloc purge 改进
- link: https://youjiali1995.github.io/allocator/jemalloc-purge/

调试

使用 GDB 查看 JeMalloc 内存布局
- link: https://blog.csdn.net/hl09083253cy/article/details/79147625
jemalloc heap exploitation framework
- https://github.com/CENSUS/shadow

Share HashMap for Different Processes by `mmap` in Ruby

yfractal — Sun, 28 Aug 2022 19:21:41 +0800

Why

A cache in the application can reduce both latency and network usage. Such a cache can replace Redis in some situations too. Erlang's ETS is a really good example.

But as we know, most Rails applications are deployed in cluster mode, and a cluster will have 3 or more processes, data can't be accessed by different processes normally.

If each process has one such cache, it will waste memory and reduce the cache hit.

So we need a HashMap that can be accessed by different processes.

We need something that works like:

_pid = Process.fork do
  insert(1, 10) # insert in child process
end

insert(2, 20) # insert in parent process
Process.wait

display # should display the 2 elements

How it works

As know, a program uses virtual memory to access physical memory, and the virtual memory to physical memory mapping is managed by the operating system.

So we can use mmap to map the same physical memory to different processes.

The code is simple:

void* create_shared_memory(size_t size) {
    int protection = PROT_READ | PROT_WRITE;
    int visibility = MAP_SHARED | MAP_ANONYMOUS;
    return mmap(NULL, size, protection, visibility, -1, 0);
}

Then we need to allocate memory for both array(HashMap is an array actually) pointers and the array data by:

struct DataItem **hashArray = (struct DataItem**)create_shared_memory(sizeof(void *) * SIZE);
void *dataArea = create_shared_memory(sizeof(struct DataItem) * SIZE);

Then we can insert item into the array by:

struct DataItem *item = (struct DataItem*) (dataArea + sizeof(struct DataItem) * hashIndex); // use the shared memory
hashArray[hashIndex] = item;

Ruby can write C extension easily:

void Init_extension(void) {
    VALUE CFromRubyExample = rb_define_module("CacheRb");
    VALUE NativeHelpers = rb_define_class_under(CFromRubyExample, "NativeHelpers", rb_cObject);
    rb_define_singleton_method(NativeHelpers, "insert", rb_insert, 2);
   // ......
}

Now we can test this by

CacheRb::NativeHelpers.init

_pid = Process.fork do
  CacheRb::NativeHelpers.insert(1, 10)
end

CacheRb::NativeHelpers.insert(2, 20)
Process.wait

CacheRb::NativeHelpers.display

After display is executed, we can see the hash has two elements, 1 => 10 and 2 => 20, one is inserted by the child process and one is inserted by the parent process.

All code is in https://github.com/yfractal/cache_rb, you can compile it by rake compile, and run CacheRb.demo in the bundle console.

What's the next?

This is a just simple or silly example to prove the idea works.

For making it useful, I will find or write a good hash map and handle memory allocation and free wisely in the following days.

MIT 6.824 Distributed Systems Reading Notes

yfractal — Wed, 24 Aug 2022 13:58:36 +0800

阅读笔记

Distributed Systems 这门课断断续续学了很久，总算是看完了。

没有像学习其他课程一样看资料 + 做 project，而是采取资料 + 做笔记的方式，一个是对 project 不是特别感兴趣，更主要的原因是懒。。。之后有机会会把项目做完。

笔记如下：

有的没的

Why

之前对并发相关的一直很感兴趣，工作上也需要对架构有更好的理解。

上一份工作的后端系统，几年间架构经过了几次改进，从最初的相对合理、可以应对一定的性能要求，到中期的局部重构，以及后期整体重构以应对较高的性能要求和扩展要求。

都是架构上的调整，看不同的人用不同的方式做同一个项目，是件很有趣的事情。

MIT 这门课程，主要是讲解各种 paper 从 MapReduce 到 Spanner，再到 Bitcoin 和 Blockstack。

具体的东西比抽象的东西有趣一些，比如 DDIA，讲数据密集型应用各种原理，讲的很好，但记起来不容易。

而 paper 就有趣的多，有具体的问题，还可以和相似的问题做横向比较。

非常喜欢这门课的老师，总能用最简单的语言描述问题的本质。

感受

最大的感受是开阔眼界。学完这个，后端玩法至少有一个大体的了解，催牛的时候不至于无话可说。

分布式系统也很有趣，没有完美的解决方案，但可以做 trade-off，比如在性能，可用性，一致性之间做取舍。

可以看到软件的发展，比如开始的 MapReduce 到后来的 Spark。

可以看到不同的玩法。比如从一开始，就把一致性放到较低优先级的 Faceboke Memache 集群，选择强一致的 Spanner，选择 Causal Consistency 的 COPS。

后端很有趣，工程上的问题也好，纯粹的技术、理论也好，总是有很多有趣的事物出现。

yfractal (yang)

Recurrent Neural Network (RNN) Introduction

How SDB Scans the Ruby Stack Without the GVL

一个简单的栈分析器（Stack Profiler）

Understanding the Page Table Step by Step

SDB generated RubyChina(Homeland) call graph

Observing Puma Thread Scheduling through eBPF

Introduction

How It Works

The Results

Others

Links

Symbolizing Ruby ISeq Through eBPF

Introduction

Background

Why Another Stack Profiler?

Troubles After Releasing the GVL

SDB eBPF Symbolizer

Others

References

无 root 权限、证书查看 Ruby HTTPS 请求内容

Introduction

How it works

How to implement

Others

Detect Ruby GVL contention through dynamic link library functions

Introduction

How it Works

Testing

Others

Does the GVL Matter?

eBPF Solution

Improvements

Summary

References

【译】垃圾回收和 Ruby RGenGC 简介

背景

介绍

垃圾回收

Mark and Sweep 算法

Mark and Compact

Generational Garbage Collection

Ruby 的 RGenGC

Ruby 需要解决的问题

Write-Barrier-Unprotected Objects

Ruby RGenGC 标记步骤 [4]

总结

引用

Garbage Collection 101 and Ruby's RGenGC (Restricted Generational GC)

Background

Introduction

Garbage Collection 101

Mark and Sweep

Mark and Compact

Generational Garbage Collection

Ruby RGenGC

The Write Barrier

The Unique Problem Ruby Faces

Write-Barrier-Unprotected Objects

Ruby RGenGC Marking Steps without WB-unprotected objects[4]:

Summary

References

Calling Rust from Ruby: How Rutie Works

Introduction

How it works

Ruby Calls Rust Functions

Bind C Functions to Ruby Class

Bind C Struct to Ruby Object

Making Ruby Work with Rust

Binding Rust Methods to Ruby Classes

Binding Rust Structs to Ruby Objects

Memory Safety

Summary

References

Rust2go: calls Go from Rust

Introduction

Benefits

How Rust2go Works

Calling Go Functions

Memory Representation

An Example Using Rust `Arc`