Strawberry Fields

大模型时代的学习

2026-05-31T01:25:23+00:00

自从22年下半年chatGPT横空出世以来，从最初的猎奇、期待，到如今的巨头厮杀、人手一份的生产力工具，LLM/VLM一次次刷新我们的认知，把越来越多的任务从我们的认知负担上移除。在我们紧盯着AI对生产关系上带来的隐忧的同时，一个更大的问题也悄然浮出水面：在大模型时代，在几乎一切问题的答案都变得如此轻易触手可及的时候，我们该如何学习？下一代的教育应是什么形式的？在有了孩子之后，这个问题变得更加的尖锐。以下我用一个想象中的对话来进行这个思想实验：

在2040年的某一天，他气冲冲地问我，学习这些枯燥的代数与公式，背诵这些老掉牙的词句和文章，钻研这些晦涩的汇编代码和算法，到底有什么用？现在其实连考试也不应该存在，只要我们学会ai的使用方式就可以了！抱怨一通后还煞有介事地宣讲起他那所谓的happy path来：

“作业用ai解决就不必说了，考试只要会用ai作弊也可以蒙混过关，以后工作中也不会露馅：全都用ai解决，token现在已经那么便宜了，我平时锻炼骑单车还能贡献一些电能，完全可以自己cover成本。。。”

我突然想到了什么，立刻打断他：

“好的，那就算你一路顺风顺水全部蒙混过关，你使用ai也非常有天赋，写的prompt就是比别人好，这样下来你成了界内的知名人士，现在请你去做演讲，参加讲座，和其他人辩论，在一个完全无法即时作弊的场景下，你怎么应对？”

“那当然是用ai来准备讲稿，帮我熟悉话题，做好准备不就可以了吗？”

“那怎么做准备呢？通过提前读答案？就算主办方能给你预热背景，你也无法准确控制其他嘉宾或者听众会对你发出什么挑战。”

“哈哈，那就让它多准备一些，我都记下来熟悉熟悉呗。“

我像抓到了救命稻草一样地打断他：

“哈哈，你进入了我的圈套。你自己想想，如果你通过ai给你准备的材料，能把这些内容熟悉到现场和他人对答如流的程度，这和你自己通过学习掌握了这些内容，又有什么区别呢？更何况，难道你能通过ai帮你准备的材料，用死记硬背的形式做到这一点吗？如果你的终极假设是这样的话，为何不从现在学起，这样到时候还能轻松一点呢？”

小年糕顿时被我的回答噎住，说不上话来。

这个思想实验拷问了问题的本质：到底什么是学习，什么是ai辅助能力的边界。在日常生活中大量使用ai的我能清楚地体会到，处理问题的瓶颈，在于自己的大脑，而不是ai工具。当工具的能力超越大脑的处理能力边界时，工具就会架空你。对于个人使用者来说，这代表工具从你的助手变成了你的主人，引导你进入了完全不了解的世界。可以想象一下李鸿章的苦恼：洋人说这个那个，要这样那样，所以最后要签这个字；可我真不懂他们在说啥！好的我签了，下一张。对于团队来说，这代表你的团队进入了隔空交流模式，每个队友的ai之间开始直接对话，人成为了传话工具：你这个MR改了什么？ai写了一份文档。好的我看看，但我看不懂，我也让我的ai来看看吧，它给出了这些建议。当然如果你觉得未来ai的能力能强大到能闭眼信，那你就可以忽略这段内容。

问题的重点是什么？大脑是瓶颈。既然我们知道了瓶颈在哪，那我们就要去掉这个瓶颈。这是回答如何在大模型时代学习的第一个切入点。我们要把自己的大脑去瓶颈化。也就是提高思考效率，提高处理速率，提高认知载荷。你应该发现，对自己熟悉的事物，比如会唱的歌，背过的课文，擅长的科目，你的大脑轻车熟路，接受类似的信息毫无障碍，甚至还会有些愉悦感。而对自己陌生的事物，跨界交流，比如第一次听Post Punk，或看一篇领域外的论文，你的大脑可能几分钟就会开始shut down，具体表现为，这他娘的是什么这么难听，这写的每个字我都看得懂，但连起来写的是什么？在这种认知负荷很高的情况下大脑的处理速率会明显降低，即便你硬扛下了一篇文章，再看下一篇，可能就有一种脑子要爆炸的感觉，难以集中注意力。这时候，大脑处理器的性能就得以体现了。经常坚持深度思考的人，大脑的前扣带皮层受到这种思考和认知阻力的锻炼，在这种满负荷的情况下能以惊人的意志力保持高效运转，从而击败那些注意力已经开始涣散的人。这也可以被称为“精力”的一种体现。而这种深度思考的过程，就是我们所谓的“学习”。也就是说，在大模型时代的学习，不是以记住、掌握、背诵多少内容为目标，而是以提升自己认知载荷为动力。这里有两个方面：

一是对内提升能力，也就是通过打破思考阻力的锻炼，让自己大脑能更快地处理更多内容。立体几何，线性代数，物理这些“思维体操”，叠加生物化学、语言历史等一些需要归纳总结、融会贯通、构建思维导图的学科，能够从接受能力、体系构建上双重提升这种能力。要知道，把自己见过的零散内容做成思维导图，然后从表层图纸内化到自己的思想体系，可是一件非常吃精力的事情。

二是对外见怪不怪，也就是把尽可能多的东西变成自己熟悉的事物。这又分两个好处：一是从不熟悉到熟悉，总有一个打破边界的过程，同时也是一个打穿认知阻力的过程；二是对于熟悉的事物，大脑的认知负荷自然下降。

在对上述这两点做到长时间的训练后，你会发现，阅读agent吐出的一大段一大段的文字和计划，不再变得吃力，你可以在相当一段时间里毫不费力地阅读完所有这些内容，并和它进行有意义的交互。其实说白了也就是为了用好ai，你得赶上ai的水平，而赶上它的唯一方法，就是学习。

以上是本文的第一部分。根据一个思想实验，从结果出发，一步步推导论证得到结论。在第二部分，我将从技术的层面，简要阐述为什么ai取代不了人类，即使在ai最最擅长的文字领域。

还是从一个实验开始：你让当今最先进的视频大模型，帮你生成一个好莱坞级别的，外星人入侵地球大战的片段，长度，就说2分钟吧。它吭哧吭哧做出来了，各种镜头切换，特效，呼吸感，末日感，科幻感，画面宏大，情节激烈，你屏着呼吸看完了，非常震撼，头皮发麻，哇ai太强了。好的，现在请你让它帮你再生成一个视频：一张白纸上有一条线，线的一头是一个三角形，它打着转滚到另一头，过程中慢慢地变成了正五边形。啥都不要，只要最简单的线条，flash动画的那种效果。

突然它失灵了。最最简单的flash特效动画，几根线，你改来改去但它怎么就是出不了这个效果。咋回事？外星人入侵地球大战，谁也没见过，模型可以天马行空地发挥，在它见过的庞大的训练数据里东拼西凑的结果，总会让人觉得满意。而一个最简单，却又是最具体的需求，它就愣住了，无法发挥了。为什么，解空间太小啦。生成式模型的diffusion输出，在如此小的一个，单一解空间上，几乎无处采样。

同样的原理也可以迁移到文字上。如果你脑子里完完全全想好了要说什么，那这句话，无论你跟ai怎么解释，它都是无法一模一样生成的。唯一可行的方法是，把你的前因后果思想过程作为context，完完整整地告诉ai；即便这样，也不能保证它能生成同样的词句，更何况你何苦呢，自己写出来不香么？这就好比，你想要拍一个电影，你脑子里已经有完全具体的画面了，什么道具放在哪里，这种情况下无论你怎么描述，ai都不可能生成你想象的样子，只能先给它一张照片，让它按照这个来。至于说什么梦境生成器，好酷炫，那是因为你根本记不得梦里具体是啥样子，它又能自由发挥了。也就是说，如果你觉得ai出来的东西太强了太好了，要么是因为你自己根本不知道想要什么，看着办吧，要么是因为你做的是上下文强依赖的工作，换句简单的话说就是，牛马ppt活。至于你脑海里那个高价值想法，等价概率只有P=0的精确token组合，ai是根本生成不出来的。如果能生成出来，说明你的上下文太简单了，简单到你都能直接告诉它。

所以，真正牛逼的想法，有价值的文字，创意，都是ai做不了的。它唯一能做的，就是归纳总结，以及通过见过的海量数据，补全想象空间。而这种搜刮文献的牛马活，也正是ai可以用来帮我们节省精力的好场景，至于总结出来的东西，节省下来的精力，最后你能用来干啥，那就要看造化了。这就是学习能帮到你的，最后点睛之笔。

以上是本文的第二部分。从解空间的角度，证明ai能力的边界。接下来第三部分，我从体验的角度，说说为什么需要学习。

自从有了ai之后，知识变得太触手可及了。再也不需要等待和挖掘的过程，一切都能被即时满足，总结好的框架和细节都能立马递到嘴边。学习似乎变成了一个非常快速的事情；工作，生活也都成了效率优先，你不快别人就比你快。有什么不懂的，大模型瞬间会把这个领域所有相关的内容，由浅到深，完完整整地展现在你面前，应用尽有。

但这从体验上来说，也是一种降级。虽然在这几次生产力革命中，我们的感官体验已经降级过好多次了。现代人本已很难体会“海上生明月，天涯共此时“的那种思念之情，更不用说短视频横行的今天，注意力不足10秒的“后当代人”；抑或是莼鲈之思、鲥鱼多刺那种从食物引申而来的细腻情感体验，在麻辣为先、万物皆可烤的当下早已湮没在辣椒的红海之中（顺便多说一句，我觉得没必要和西方人去强调什么中国饮食的原味，人家从小吃的都是hyper-processed food，或者各种重口味的东西，就算味蕾能收到，神经突触也接收不到那些食物本味的东西，像蒲菜、蚕豆这些，对他们来说就跟尝水没什么区别）。大模型的出现只是把这种体验降级推到了极致，把你的大脑shortcut. 人们逐渐忘记学习和思考是一个什么样的过程，更不用说感受到它们的美。学习本身是一种顺着引导往里走的过程，但在这种开门见山的方式下，人们失去了曲步通幽的耐心，江南园林的含蓄反而成了一种阻碍，火起来给你墙全部敲掉。这使人失去了一种慢慢发掘、欣赏知识的美的能力；记得我以前本科学习ELEC 241 电子工程入门课的时候，里面的概念一个个出现就像赶路一样，慢慢地在你眼前展开、铺平，抽丝剥茧地学习，最后拼成一张完整图像的时候那种惊喜与魅力，是非系统性学习、短时间高剂量接触所不可比拟、无法体会的。

从这个角度上来说，学习的真正过程是：别人讲不清楚的时候，自己帮助并尝试理解。大模型的答案直接展现就是死记硬背，没有过程的强行记忆QKV。

以上是三部分我对大模型时代什么是学习，该怎么学习的一点思考。日后随着技术发展，也许一些观点也会显得过时，先留文在此吧。

Ray OOM Prevention: Best Practices

2025-12-11T21:24:23+00:00

Background

When running distributed workloads with Ray, we’ve been intermittently hitting worker OOM issues.

Sometimes it’s minor — a few workers restart and end up corrupting Parquet outputs. Other times the entire job crashes. What makes this especially annoying is that it’s not deterministic. You rerun the same job and the OOM might not happen again. It’s very hard to reason about, and as a result we’ve been hesitant to scale jobs to larger sizes.

Fact Check

Worker Killed

There are two main cases: Ray proactively killing workers, and the system killing them underneath Ray. Reference

Raylet’s OOM monitor tracks total memory usage on a node, including worker heap, object store, and Raylet internal usage. Once it exceeds a threshold (95% by default), it will pick a worker and kill it:

[2025-12-05 08:50:23,174 E …] (raylet) node_manager.cc:3069:
1 Workers (tasks / actors) killed due to memory pressure (OOM),
0 Workers crashed due to other reasons at node (ID: …)

There is an important line in the doc:

The memory monitor avoids infinite loops of task retries by ensuring at least one task is able to run for each caller on each node. If it is unable to ensure this, the workload will fail with an OOM error.

This basically means if Ray believes no further progress can be made, it will fail immediately.

Another subtle point: Ray does not retry exceptions raised by application code. So even though it says it may ignore max_retries and retry remote tasks, in practice if the OOM surfaces as an application-level exception, setting max_retries=0 will still cause an immediate failure:

ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Memory on the node (…) was 45.62GB / 48.00GB (0.950375), which exceeds the memory usage threshold of 0.95. Ray killed this worker because it was the most recently scheduled task.

When deciding which worker to kill, Ray follows a policy:

Prefer killing remote tasks over actors (actors are stateful and harder to recover)
Prefer killing tasks from the caller with the most running tasks (fairness)
Among those, kill the most recently started task (least wasted work)

This avoids infinite retry loops.

In theory retries should make this safe. In practice, it depends on the task. If you’re doing something like writing a Parquet file that cannot resume, killing the worker just means corrupted or lost data. So we still want to avoid getting into this situation.

There are two environment variables that control this behavior:

RAY_memory_usage_threshold (default 0.95)
RAY_memory_monitor_refresh_ms (default 250 ms)

If you set the refresh interval to 0, you effectively disable Ray’s memory monitor.

OS OOM (Kernel Kill)

If Ray’s monitor is disabled, or memory spikes too quickly for it to react, the kernel OOM killer will step in.

You’ll see logs like:

Worker exit type: SYSTEM_ERROR Worker exit detail: The leased worker has unrecoverable failure.

A worker died or was killed while executing a task by an unexpected system error.

Possible causes: (1) SIGKILL by OOM killer due to high memory usage (2) ray stop –force (3) SIGSEGV or other unexpected errors

At this point Ray can’t really help. It just observes that the worker disappeared and may retry. Reference

Remote Memory Option

So how do we reduce the chances of hitting these cases?

Reference

We can use:

process.options(memory=N * 1024 * 1024 * 1024)

to hint how much memory each worker needs.

But this is often misunderstood.

First, this is not a hard limit. It’s just a scheduling hint. If a node has 22GB and each worker requests 4GB, Ray will schedule 5 workers and stop:

Warning: The following resource request cannot be scheduled right now:
{'memory': ..., 'CPU': ...}

Second, Ray does not enforce this at runtime. If a worker exceeds the declared memory, nothing happens immediately. If enough workers exceed their estimates, you still hit the global threshold, and then either Ray or the OS starts killing processes.

Mitigation

The most reliable approach we found is to assume the worst case based on the smallest node.

Instead of thinking “how many workers can I run on average”, think:

On the node with the least memory, how many workers can I safely run?

Let the minimum memory across nodes be $M_{\min}$, and each worker needs $C$. Then we conservatively run:

$K = \lfloor M_{\min} / C \rfloor$ workers per node.

If the job runs on $N$ worker nodes, total workers = $N \cdot K$.

In our setup CPU is not the bottleneck — memory is. Each node has plenty of CPU, so we can back out num_cpus per worker from total CPU / total workers.

For the head node, follow Ray’s guidance and don’t schedule tasks on it: Reference

Also be careful with memory usage in the driver script. Especially ray.get. Even if tasks return small values, each result includes metadata and object references. Calling ray.get on a large list can put significant pressure on the head node, especially combined with GCS. Reference

Validation

All of the above is reasoning, so we validated it with a simple stress test.

We used a minimal task that aggressively consumes memory via mmap:

def consume_data(self, input_data):
    num_bytes = int(round(self._test_mem_size_gb * GiB, 0))
    mm = mmap.mmap(-1, num_bytes)
    mm.write(b"\0" * num_bytes)
    time.sleep(0.1)
    return True, {"size": mm.tell() // GiB}

We ran two jobs. In both cases actual usage was about 5GB per worker.

With memory set to 4GB, Ray scheduled too many workers and the job quickly OOMed.

With memory set to 8GB, giving some buffer, Ray scheduled fewer workers and the job ran successfully. We repeated this multiple times with max_retries=0 to ensure any OOM would immediately fail, and the behavior was consistent.

If we increased the workload, the same pattern held. Slight underestimation still caused OOM. Generous overestimation stabilized the system.

Example

At some point experiments aren’t enough, so we just scaled it up.

Using this approach, we were able to run large-scale CPU jobs that we previously couldn’t stabilize: • 600 nodes × 60 CPUs, processing ~6.33M data in 108 minutes • 750 nodes × 60 CPUs, processing ~4M data in 42 minutes

There was one Raylet kill in the second run, but retry recovered it successfully.

Overall, the key takeaway is that Ray’s memory controls are not strict enforcement. Stability comes from conservative planning, especially based on the weakest node, and leaving enough buffer so neither Ray nor the OS is forced into reactive killing.

LLM Study Notes IV: Multimodal Large Language Models

2025-09-22T14:27:43+00:00

VLM (Vision-Language Models)

VLMs exhibit a strong capability of both image and text understanding. A good use case is answering questions regarding the image. This would involve both spatial and semantic understanding of the image, with basic knowledge and reasoning abilities in text. The basic structure is taking embeddings from both image and text encoders and perform cross attention.

Pre-Training

This is the contrastive learning part in CLIP, where massive dataset of image-text pairs are used to align image tokens with text tokens.

Fine-Tuning

The aligned tokens are then cross-attended and trained for specific downstream tasks, such as VQA, image-captioning, image-text retrieval, etc. The dataset used for fine-tuning is usually smaller but requires high quality, and sometimes synthetic data combined with bootstrapping methods are used (like in BLIP).

Vision Transformer (ViT)

It’s basically just a flattened language transformer.

First of all, the model overview. Vision transformer is encoder only:

Vision Transformer Architecture

The whole point is to flatten 2D image patches into 1D sequence of embeddings, and position embedding is still based on learnable 1D, since no significant improvement when using 2D-aware position embeddings. This linear projection of flattened patches is the tokenization of image patches, analogous to the tokenization of words in LLM, except for LLM it’s using a pre-defined token mapping, usually not through neural network.

The classification is similar to that of BERT, by prepending a class token at the front of the flattened embeddings, so we have $z_0^0=\text{}, z_0^1, ..., z_0^T$, and the final encoder output embeddings would be $z_L^0=\mathbf{y}, z_L^1,...,z_L^T$. The length of output embedding sequence depends on the patch and image sizes.

There are two common ways to digest these embeddings:

Using just one pooled embedding $\mathbf{y}$ only, where it goes through a linear head to get final prediction logit for classification. This is the main purpose in the original ViT paper. CLIP uses this $\mathbf{y}$ embedding as the summary token for later text embedding alignment. Note that even though only token is explicitly used to generate classification tag, the bidirectional attention ensures the other padding embeddings are also well-encoded and contain meaningful information about the image.
Use the entire encoder output embeddings. This provides a much richer context with spatial information of the image for downstream task. For example,
1. Masked Autoencoder (MAE) uses random masks on patches and their embeddings, and train the decoder to fill out these parts;
2. DINO uses two ViTs, teacher and student, on same image with different augmentations, to align feature distributions for self-distillation, which is really strong on semantic grouping/clustering;
3. Most VLMs such as BLIP use the full patch embeddings for cross-attention with text embeddings for tasks like caption generation and image QA.

ViT is the cornerstone of all multi-modal models. Depending on different type of tasks trained downstream, the pre-trained ViT encoders have different properties accordingly. Based on the model size and data scale, VLMs can choose whether to use frozen weights ViT, PEFT (usually LoRA), or full-fledge e2e training. Length of the embeddings depends on the patch and image sizes.

CLIP (Contrastive Language-Image Pre-Training)

CLIP Pipeline

The figure in original CLIP paper illustrates its contrastive training nature perfectly. The output embeddings of text encoder and ViT encoder are individually projected into a joint multi-modal embedding space:

\[\begin{align}&\mathbf{z}_L^\text{text} \in\mathbb{R}^{N\times D_t}, \mathbf{z}_L^\text{img}\in\mathbb{R}^{N\times D_i}\nonumber\\&W_\text{text}\in\mathbb{R}^{D_t\times D_e}, W_\text{img}\in\mathbb{R}^{D_i\times D_e}\nonumber\\&e_\text{text}=\mathbf{z}_L^\text{text}W_\text{text}, e_\text{img}=\mathbf{z}_L^\text{img}W_\text{img}\in\mathbb{R}^{N\times D_e}\end{align}\]

A similarity function is then measured across each image and text embedding pairs, with loss

\[\mathcal{L} = -\frac{1}{N}\sum_i(\log\frac{\exp(\text{sim}(e_\text{text}^i, e_\text{img}^i)/\tau)}{\sum_j\exp(\text{sim}(e_\text{text}^i, e_\text{img}^j)/\tau)}+\log\frac{\exp(\text{sim}(e_\text{img}^i, e_\text{text}^i)/\tau)}{\sum_j\exp(\text{sim}(e_\text{img}^i, e_\text{text}^j)/\tau)})\]

In implementation this similarity function is chosen as cosine similarity, which is a dot product, and loss is often simplified by averaging the image-to-text and text-to-image distances:

z_text = text_encoder(input_text) # [N, D_t]
z_img = image_encoder(input_img)  # [N, D_i]

# Project into same space
embeddings_t = F.normalize(torch.matmul(z_text, W_text), p=2, axis=1)  # [N, D_e]
embeddings_i = F.normalize(torch.matmul(z_img, W_img), p=2, axis=1)  # [N, D_e]

dist_matrix = torch.matmul(embeddings_t, embeddings_i.T) * np.exp(-tau)  # [N, N]
labels = torch.arange(n)
loss_t = F.cross_entropy_loss(dist_matrix, labels, axis=0)
loss_i = F.cross_entropy_loss(dist_matrix, labels, axis=1)
total_loss = (loss_t + loss_i) * .5

This is contrastive because there is no explicit labels; the loss is generated by comparing relatively the match/similarity between embeddings. At inference time the inputs go through pre-trained encoders and embeddings are compared in the same fashion, where the highest probability text is selected. In implementation the given set of texts are encoded once and cached, so the computation would be unreasonable.

BLIP (Bootstrapping Language-Image Pre-Training)

BLIP Pipeline Structure

In order to achieve both understanding and generation capabilities, BLIP trains three modules together for the unified model:

a unimodal encoder that separately encodes image and text, where text has token to summarize its content like in BERT, which is then used to compute image-text pair contrastive loss, against the image embeddings. This achieves basic alignment between text and image feature spaces.
an image-grounded text encoder that cross-attend text embeddings with image embeddings, and a linear layer head is used to perform binary classification of whether this image-text pair matches. This achieves more fine-grained alignment between vision and language.
an image-grounded text decoder that cross-attend text embeddings with image embeddings. It trains the decoder the same way GPT does in an autoregressive way, by maximizing the log likelihood of the token sequence. This achieves text generation from image ability.

Another highlight in BLIP is its bootstrapping method for populating the training dataset. Since high-quality labeled image-text data are expensive, it uses the pre-trained model on higher-quality dataset to filter out wrong image-text pairs from noisy web data, as well as to re-generate captions to replace incorrect image descriptions crawled from web. The purified synthetic data are then gathered for further training, closing the loop.

VLMs are good at specialized downstream tasks, but in order to achieve an AGI-like, general-purpose assistant with strong reasoning and output across all modalities, we need to close the last gap with multimodal LLMs.

VLA (Vision-Language-Action Models)

VLA is a type of model that directly outputs action modality, widely adapted in robotics and autonomous driving. On top of the perception and reasoning ability of VLM, it also learns how to act, mostly in the field of robotics and embodiment agents. Similar to humans, VLA agents can interact with the physical world, thus actively modifying the perception state for itself. VLA generally uses a VLM backbone, with an action head, and is fine-tuned by high-quality instruction-action pairs.

RT-2

RT-2 Pipeline

The robotic-transformer paper from google uses a simplified variation of the standard process. It directly represents the robot actions as text strings in a fixed format, so that is still part of the language output, and fine-tuning any VLM is made simple.

$\pi_0$ Flow Model

pi0 Model Structure

$\pi_0$ model from Physical Intelligence takes another direction: it devised a brand new action tokenization method (FAST) and uses flow-matching diffusion method for action generation.

MLLM (Multimodal Large Language Models)

MLLM systems are trained such that all modalities (vision, audio, text, etc) share the same representational token space. They are modality-agnostic transformers. They use the same principle as VLMs, but generalize it to a broader scope, with video, audio, and image projectors that convert all signals into tokens that LLM can process just like words.

Alignment

This is the extra phase comparing to LLM training, because new modalities need to be projected into text space. The alignment phase is training these other modalities’ projectors with large-scale modal-text pairs. The pre-trained LLM itself is frozen, because the goal is to align embeddings, not teaching LLM new facts.

Instruction Tuning

The LLM weights are then opened and SFT starts, to unlock the reasoning ability. The datasets are usually instructional prompts about answering questions regarding the multi-modal inputs, and LLM is now learning to use its foundational knowledge to reason about those inputs.

RLHF

Similar to LLMs, a preference model or human preference labels with PPO/DPO algorithms extend the RLHF to multimodal inputs.

Beyond Language-Backboned Models

Be it VLM or MLLM, up to now all modalities are projected and unified into word embedding space, where the backbone is a pre-trained LLM that stores foundational knowledge in language domain. This works well in many cases, but we all as human, understand the subtle the linguistic bias of language; text description of the world is a lossy compression—hence the “lost in translation” between languages. Even for a simple image description task, describing every single detail is extremely inefficient. Moreover, human or even animals can perform intuitive spatial reasoning and physics predictions without languages. What if we instead train a model where knowledge is directly stored in “visual token”? This might be one step closer to the ultimate “universal token” that underlies the nature of everything. We can call it the world model.

Storing knowledge in visual token is challenging. The unstructured, continuous nature of vision, and the curse of its high-dimensionality makes tokenizing it just like text extremely hard. There are, however, many works in the specialized realm of robotics that had already made some good attempts: the imaginary rollout by predicting the next frame given current frame and action is already used in some cutting-edge research projects, for example 1X. This new type of simulation enforces the model to internalize fundamental understanding of physics for gravity and object manipulation. This type of vision-based model has however obvious shortcomings: 1. vision captures physics properties such as affordances and dynamics, but they are only physical appearances, lacking the tactile dimension; it’s also unable to capture abstract concepts, which are surprisingly well captured by language, as a compressed form of knowledge; 2. it requires an insane amount of video to train a comprehensive, generalized model that “understands” the dynamics between a variety of scenarios.

NVIDIA Cosmos

Cosmos is a large-scale video-first world foundation model that shifts the backbone of multimodal learning from language to vision and dynamics. Instead of aligning everything into text embedding space, Cosmos learns directly from massive video corpora, storing its foundational knowledge in visual/world tokens.

The training pipeline has three main stages:

Video tokenization. Raw videos are first compressed into discrete or continuous tokens using specialized video tokenizers (encoder–decoder style, similar to VQ-VAE or neural codecs). This reduces the high-dimensional video space into compact, learnable tokens while retaining spatial–temporal structure.
World model pre-training. Two model families are trained on hundreds of millions of video clips:
- Diffusion-based WFMs, which de-noise and predict continuous video tokens.
- Autoregressive WFMs, which learn to predict future discrete tokens step by step.
Both approaches push the model to internalize physical dynamics, temporal reasoning, and long-horizon structure beyond static recognition.
Post-training adaptation. The pretrained WFM is fine-tuned for downstream embodied AI domains:
- Autonomous driving, where the model predicts future traffic scenes from onboard video.
- Robotics and manipulation, where it learns affordances and physical causality.
- 3D navigation and camera control, where the model anticipates environmental changes to guide actions.

To bridge this visual world model with human interaction and higher-level reasoning, Cosmos integrates a language head. World tokens from the video backbone are projected and aligned with text tokens, enabling tasks like instruction following, dialogue, and abstract reasoning. Instruction tuning and RLHF further refine this alignment, so the system can use language as an interface while keeping its foundational knowledge rooted in vision and dynamics.

The key contribution of Cosmos is reframing foundational knowledge as world understanding rather than linguistic reasoning. By learning directly in the video domain, Cosmos aims to capture intuitive physics and causal dynamics that are difficult to compress into language descriptions.

To close the gap between abstract knowledge and symbol grounding, current research usually takes a mixed approach and combine the advantages from both worlds—use language for reasoning and factual knowledge, vision for high-dimensional frame prediction and scene understanding. We are on our way to expand the vision knowledge outside a confined dynamic (e.g., autonomous driving scenarios) and action space (e.g., state space models for robot action planning) to a more generalized, knowledgeable unified MLLM.

LLM Study Notes III: Post-Training

2025-09-12T22:03:30+00:00

SFT (Supervised Finetuning)

This is the most intuitive first step after getting a pre-trained model, that is able to auto-regressively generate tokens that make sense. The model now has the ability to make full sentences, continue speech, or “understand the meanings” of questions, but still requires guidance to behave “normally” in human eyes. We can enforce the model to generate what we want, by feeding it such data, so it’s not just blindly spitting out words, but also generating texts in a way that we expect it to be, hence the “supervision”. Take the simplest example of QA pairs: the prompt (prefix for the decoder) is a question, and expected prediction sequence is the answer. We can compute the loss between actual answer from the model and the ground truth answer, and update the model:

just like in pre-training, it uses teacher forcing, so it’s basically feeding in the selected [Q + A] sequence. This is the same format used in pre-training, just with some special tokens separating question and target answer. The inputs are like

\[\text{}, q_0, q_1, ..., q_{n-1}, \text{}, a_0, a_1, ..., a_{m-1}\]

where the model outputs

\[\hat{q}_0, \hat{q}_1, ..., \hat{q}_{n-1}, \text{}, \hat{a}_0, \hat{a}_1, ..., \hat{a}_{m-1}, \text{}\]

The only difference is we mask off the question part, and only compute cross entropy loss between logits of generated answer tokens and ground truth answer tokens.

For proper full post-training SFT, the entire model is finetuned in this fashion end-to-end. For some specific tasks or small datasets, parameter-efficient finetuning techniques (PEFT) such as LoRA, prefix-tuning are also used, which freezes some layers in the LLM and not changing parameters across the entire model.

What’s the scale of the SFT dataset, and how many labelers/crawled data from internet?

Reward Model

In preparation to make the model ready for deployment, just handwaving cramming it with preferred data is not sufficient. There should be at least some metrics to score the model response. One way to do that is training a reward model, that takes a [prompt, response] sequence and generates a scalar score. The training of the reward model involves heavy human labeling, which is the “human feedback” part in RLHF. The labeler’s job are not giving scores—it would be too subjective and unintuitive to give absolute scores on a scale. They simply compare two responses model generated from one prompt, and labels which one they prefer. This way the ranking of multiple responses can be confirmed. The training signal comes from this A/B comparison:

score1 = RM(A1 | P) # model generates score of answer 1, given prompt
score2 = RM(A2 | P)
pref = sigmoid(score1 - score2) # if score 1 higher, prefers answer 1
loss = binary_cross_entropy(pref, label.float()) # compare with actual label for the loss

The reward model is usually simple as one reward MLP head attached on the embeddings output from LLM, and LLM parameters are obviously frozen during reward head training.

Reinforcement Learning w Human Feedback (RLHF)

The RL part is essential to achieve a human-like performance from the model, and it combines the previous SFT and reward model. The RL agent, or the initial policy, starts from the SFT model. The reward model is crucial to give feedback to the agent for policy updates. There are a few common setup, widely adapted by various finetuning methods.

Start from simple case where one episode is one prompt —> one answer. This is the most common and intuitive approach. Different from regular RL, for LLM we need to note

$r_t = 0$ for all steps, except for $r_T$ which equals to $\text{RM}(s_T)$. This makes the reward signal extremely sparse.
usually the response is punished by how much updated policy deviates from the original (SFT) reference policy. This is reflected in the reward signal, which is commonly set as

\[r = \text{RM}(s_T) - \beta \text{KL}(\pi_\theta \| \pi_{ref})\]

There is no state transition, no external environment dynamics, and the new state is just the next token (action) concatenated with previous states/actions (prefix).
The sequence log-prob $\log\pi_\theta(y\mid x)$ is very useful in policy gradient algorithms. The occurrence probability for a sequence of tokens is $p_\theta(\tau) = p_\theta(x_0)\cdot p_\theta(x_1 \mid x_0)\cdot p_\theta(x_2 \mid x_0, x_1)\cdot ...$, which becomes \(\log p_\theta(\tau) = \sum_{t=1}^T\log p_\theta(x_t \mid x_{

Now we can check some popular algorithms used in RL for LLM.

Proximal Policy Optimization (PPO)

This algorithm was devised in 2017 and has been since popular in all branches of RL. It is widely used in fields such as robotics due to its intrinsic stability in training. It was adapted by InstructionalGPT at some early versions of GPT series, and contains an actor-critic network.

The actor is the SFT model that samples actions by policy. The policy update is nothing special, and no different from original PPO algorithm:

\[\begin{align}r_t(\theta) = \frac{\pi_\theta(a\mid s)}{\pi_{\theta_{\text{old}}}(a\mid s)} &=\exp(\log\pi_\theta(a_t\mid a_{where \(r_t(\theta)=\frac{\log\pi_\theta(a_t\mid a_{

Now let’s look at the critic network. This is for me the most challenging part. It basically involves how to compute the crucial $A$ advantage value used for the policy update. We know advantage is how much better the action behaves than the baseline, from the definition:

\[A(s_t, a_t) = Q(s_t, a_t) - V(s_t)\]

$Q(s, a)$ means the expected return from $s$ if $a$ is taken; $V(s)$ means the expected return from $s$ following normal policy; then $A(s, a)$ means the relative improvement when choosing $a$.

Both terms on the RHS are unknown. What can we do about them?

Remember we are training an actor-critic style network. What if we train both $Q$ and $V$ networks? This creates redundancy, because it is equivalent to train one advantage network $A$; but we cannot do that—because this causes circular dependency, as actor network (policy) is dependent on $A$ as well. So our choice is to train either $Q$ or $V$. In the early days of reinforcement learning, there are some ground breaking works on training $Q$ network—Q-learning and DQN are good representative methods. However to train Q, we need to condition on both the state and action, and the action set here is the total number of tokens defined, which blows up the state-action space. A much simpler and cheaper option is to train the value function network $V$. This V-network becomes our critic.

We have decided to train $V$ as our critic network, then what do we do about $Q$? There are two methods that approximate Q value:

Monte Carlo: use observed return $R_t$ directly.
\[\begin{equation}\hat{Q}(s_t, a_t) = \sum_{k=0}^{T-t}\gamma^{k}r_{t+k}\end{equation}\]
This is basically expanding the full rollout and uses the return at the end of the entire trajectory, which is equivalent to use the scalar score from trained reward model given current $s$ (prompt + full response). This will assign the same Q value for every single token in that response, because the reward is only given at the end of the sequence. It is unbiased because it’s not using the value estimation, but it has very high variance, since a different action may change the course of the trajectory greatly, causing a very different return at the end.
Temporal Difference (TD-1) target: bootstrapping with value network.
\[\begin{equation}\hat{Q}(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1})\end{equation}\]
This is basically relying on the value network to estimate the transition. This method has low variance because it’s using the value network, but has high bias, since value network can have incorrect estimation and kept giving wrong Q values.

Is there a way to mitigate the disadvantages of those two extreme estimation methods while keeping their advantages? We can see Monte Carlo is picking up returns from all steps, while TD only picks current step. The formulation that mathematically describes the range between those two extremes is $k$-step return:

\[\begin{equation}R_t^{(k)} = \sum_{i=0}^{k-1}\gamma^ir_{t+i} + \gamma^kV(s_{t+k})\end{equation}\]

which estimates the return by rolling forward to step $t + k$. When $k$ goes from 0 to infinity, it goes from TD-1 to Monte Carlo. To get a good estimation of all these returns, one natural approach is to assign a coefficient for each time step:

\[\bar{R}_t = \sum_{k=1}^\infty c_kR_t^{(k)}\]

In his 1988 paper Temporal Difference Learning with Eligibility Traces, Sutton made a brilliant choice of these coefficients: using geometric distribution, by setting $c_k = (1-\lambda)\lambda^{k-1}$:

\[\hat{R}_t^{\text{TD}(\lambda)} = (1-\lambda)\sum_{k=1}^\infty \lambda^{k-1}R_t^{(k)}\]

This way all coefficients add to one, and when $\lambda\rightarrow0$, it evaluates to Monte Carlo, when $\lambda\rightarrow1$, it evaluates to TD-1. This new mixture estimation of return value is called TD($\lambda$).

How does this connect to our advantage function? Let’s start again from TD-1 and plug in the estimation into advantage definition directly, to see what we have:

\[\delta_t^V = [r_t + \gamma V(s_{t+1})] - V(s_t)\]

This is called TD-error, which means the surprise model got by taking this action. This can also be used as the training signal for the value network, which we will mention later. If we imitate what we did above, we get $k$-step TD-error:

\[\begin{equation}\delta_{t+k}^V = r_{t+k} + \gamma V(s_{t+k+1}) - V(s_{t+k})\end{equation}\]

To get the total error over $k$-steps, accumulate $\delta$ by $\gamma$ discount factor:

\[\begin{align}&\delta_t^V + \gamma\delta_{t+1}^V + ... + \gamma^K\delta_{t+k}^V =\nonumber\\ &[r_t + \gamma V(s_{t+1}) - V(s_t)] + \gamma[r_{t+1} + \gamma V(s_{t+2}) - V(s_{t+1})] + ...=\nonumber\\ &r_t + \gamma r_{t+1} + ... + \gamma^{k-1}r_{t+k-1} + \gamma^kV(s_{t+k}) - V(s_t)\nonumber\end{align}\]

we can see the intermediate $V(s_{t+i})$ terms all got cancelled out. This is the $k$-step estimator of the advantage function:

\[\begin{equation}\hat{A}_t^{(k)} = \sum_{l=0}^{k-1}\gamma^l\delta_{t+l}^V = \sum_{l=0}^{k-1}\gamma^lr_{t+l}-V(s_t)\end{equation}\]

Now we apply the geometrically weighted mixture of $\hat{A}_t^{(1)}, \hat{A}_t^{(2)}, ..., \hat{A}_t^{(k)}$, with $k\rightarrow\infty$, just like how we got TD($\lambda$), which gives the Generalized Advantage Estimator, $\text{GAE}(\gamma,\lambda)$:

\[\begin{align}\hat{A}_t^{\text{GAE}(\gamma,\lambda)} &:= (1-\lambda)(\hat{A}_t^{(1)} + \lambda\hat{A}_t^{(2)} + \lambda^2\hat{A}_t^{(3)} + ...) \nonumber\\&=(1-\lambda)(\delta_t^V + \lambda(\delta_t^V+\gamma\delta_{t+1}^V)+\lambda^2(\delta_t^V+\gamma\delta_{t+1}^V+\lambda^2\delta_{t+2}^V)+...)\nonumber\\&=(1-\lambda)(\delta_v^V(1+\lambda+\lambda^2+...)+\gamma\delta_{t+1}^V(\lambda+\lambda^2+\lambda^3+...)+\nonumber\\ &\gamma^2\delta_{t+2}^V(\lambda^2+\lambda^3+\lambda^4+...) + ...)\nonumber\\&=(1-\lambda)(\delta_t^V(\frac{1}{1-\lambda})+\gamma\delta_{t+1}^V(\frac{\lambda}{1-\lambda})+\gamma^2\delta_{t+2}^V(\frac{\lambda^2}{1-\lambda})+...)\nonumber\\ &=\delta_t^V+\gamma\lambda\delta_{t+1}^V+(\gamma\lambda)^2\delta_{t+2}^V+...\nonumber\\&=\sum_{l=0}^\infty(\gamma\lambda)^l\delta_{t+l}^V\end{align}\]

Note we have $\lambda\in[0, 1]$, hence $1+\lambda+\lambda^2+... = \frac{1}{1-\lambda}$, and so on. This is the $\lambda$-weighted view of GAE, which expands all TD-errors for deduction. There is another view that uses the geometric distribution of $A_t^{(k)}$ directly:

\[\begin{align}\hat{A}_t^{\text{GAE}(\gamma,\lambda)}&=(1-\lambda)\sum_{k=1}^\infty\lambda^{k-1}A_t^{(k)}\nonumber\\&=(1-\lambda)(\sum_{k=1}^\infty\lambda^{k-1}\sum_{l=0}^{k-1}\gamma^l\delta_{t+l}^V)\nonumber\\&=(1-\lambda)\sum_{l=0}^\infty\frac{(\lambda\gamma)^l\delta_{t+l}^V}{1-\lambda}\nonumber\\&=\sum_{l=0}^\infty(\gamma\lambda)^l\delta_{t+l}^V\end{align}\]

This advantage is by definition, then error between estimated $V_\phi$ and value function target $V$. This way we can find the training signal for the critic network:

\[V_t^G = V_\phi(s_t) + \hat{A}_t^{\text{GAE}}\]

this achieves the bootstrapping of an accurate value function estimation. The full training loop looks like this:

Roll out the entire sequence under current policy $\pi$
Compute $\hat{A}_t^{\text{GAE}}$ with returns from reward model and current $V_\phi$
Use this advantage value $A_t$ to update both actor (policy) and critic (value network)
Repeat the process until convergence

The value network $V_\phi$, is also a head attached to decoder embedding outputs, just like the reward model. Over training loops it is pulled towards the correct side by signal sent in advantage values, which uses the reward model outputs for returns. Let’s look into the details, keeping in mind the special features of LLM RL, that reward is only given at the end of the sequence, when $t = T$:

\[\begin{align}\hat{A}_t^\text{GAE} &= \sum_{l=0}^\infty(\gamma\lambda)^l\delta_{t+l}^V\nonumber\\&=\sum_{l=0}^{T-t}(\gamma\lambda)^l(r_{t+l} + \gamma V(s_{t+l+1}) - V(s_{t+l}))\nonumber\\&=\sum_{l=0}^{T-t-1}(\gamma\lambda)^l(\gamma V_\phi(s_{t+l+1})-V_\phi(s_{t+l})) + (\gamma\lambda)^{T-t}(r_T-V_\phi(s_T))\end{align}\]

Direct Preference Optimization (DPO)

Remember with PPO, it takes a separately trained reward model and computationally expensive RLHF process to align the model with human preference. The point of DPO is to simplify this process and achieve the same effect by training directly on the labelled preference data. It captures the reward model implicitly by learning the preference data as a supervised classification problem, instead of explicit reward model + RL. DPO is used for later GPT series, such as GPT-4o. The deduction of DPO updates are quite math-intense and involved, and I will just summarize some key points here. The policy training objective is to maximize

\[\mathcal{L} = \mathbb{E}_{(x,y_w,y_l)\sim D}[\log\sigma(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{ref}(y_w\mid x)}- \beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{ref}(y_l\mid x)})]\]

The authors continued to show the link between DPO and reward model. First they defined a normalization partition function

\[Z(x) =\sum_y\pi_{\text{ref}}(y\mid x)\exp(\frac{r_\phi(x, y)}{\beta})\]

where $r_\phi$ is the reward model. The optimal aligned policy model is

\[\pi^*(y \mid x) = \frac{\pi_{ref}(y\mid x)\exp(\frac{r_\phi(x,y)}{\beta})}{Z(x)}\]

and

\[r_\phi(x, y) = \beta[\log\frac{\pi^*(x\mid y)Z(x)}{\pi_{\text{ref}}(y\mid x)}+\log Z(x)]\]

It also discussed the Bradley-Terry model (for pairwise data) as well as the more general Plackett-Luce model (for ranking with more than two data points). In summary the training loop looks like this:

Given prompt, preferred response, loser response $(x, y_w, y_l)$, compute the sequence log-prob under policy $\log\pi_\theta(y \mid x)$, as well as the sequence log-prob for frozen reference policy.
Compute the DPO score for each response, here for the winner and loser:
\[s_\theta(y \mid x) = \beta(\log\pi_\theta(y \mid x) - \log\pi_\text{ref}(y \mid x))\]
Compute the pairwise log-sigmoid objective, and minimize this loss by back propagation.
\[\mathcal{L}(x, y_w, y_l) = -\log\sigma(s_\theta(y_w \mid x) - s_\theta(y_l \mid x))\]

Group Relative Policy Optimization (GRPO)

GRPO is devised by DeepSeek and achieves stunning performance with significantly less parameters and complexity, thanks to its omission of critic network. Just like PPO, GRPO also uses advantage values to update the policy model, but instead of using a separate value model (critic) to estimate GAE over one output sequence to get advantage, it makes use of a group of output sequences $s$ and compute their return values $R_i=\text{RM}(p, s_i)$, which are averaged to get a baseline score $\bar{R} = \frac{1}{G}\sum_{j=1}^GR_j$, used to compute the advantage $A_{i} = R_i - \bar{R}$ for each of them. This comparison is well-illustrated by the figure in the original paper, attached below.

Original Figure in GRPO Paper

There are a few more subtle points not captured by the figure:

The KL divergence penalty is not in the GRPO reward, but directly to the loss; it also uses a different form from PPO and is guaranteed positive.
The advantage $A_{i}$ here is sequence-level advantage, where reward is only given at the end of each sequence. In the original paper, DeepSeek call this outcome supervision RL and normalize it by $\tilde{R}_i=\frac{R_i - \bar{R}}{\sigma(R)}$; however, the paper views it as insufficient, because the advantage is broadcasted to each token, so that all tokens are updated in the same direction. The paper then brings up process supervision RL, which assigns per-token, or per-step reward to overcome this problem. It would require a process reward model to give reward at each step, and they are normalized in the same way by $\tilde{r}_{i,j} = \frac{r_{i,j} - \bar{R}}{\sigma(R)}$, where advantage $A_{i, j}=\sum_{j\geq t}\tilde{r}_{i,j}$. This process reward model is claimed to be heuristic functions.
The critic-free model is in nature less stable in training, compared to actor-critic models. GRPO has this weakness, and is very sensitive to the batch size, because less data means noisier baseline. In practice with large batch size and sufficient amount of data, GRPO can overcome this weakness.

Regarding the second point, one question I had was, why not use similar GAE method to distribute sequence-level reward back to each token? The problem is PPO has a trained critic network that approximates the value function, which enables the GAE. GRPO avoids this critic network, and does not have such signal from the group of outputs. The process reward model or function is thus necessary.

Reinforcement Fine-Tuning (RFT)

RFT is claimed to be used by Anthropic Claude. It skips reward model in a different way than does DPO. It shares the same policy update object with PPO, and for the value model update, it adds an additional clipping term just like for policy, to stabilize the critic model training:

\[\mathcal{L_V}(\phi) = \frac{1}{2}\mathbb{E}_{\mathbf{e}\sim\pi_\text{old}}[\max(\|V_\phi(s_t)-\hat{R}_t\|^2, \|\text{clip}(\hat{R}_t-V_\phi(s_t), \hat{A}_t-\epsilon, \hat{A}_t+\epsilon)\|^2)]\]

The policy model in RFT does not rollout data like in regular RL. Instead, these are human-labeled preference data, the same ones that can be used to train reward model. However RFT skips the reward model training part: it focuses on Chain-of-Thought and extracts answers from the process, which are used to compute the advantage value for the future training process. It is also a soft blend between PPO and SFT, where PPO is pure RL with reward values from another model, and SFT is equivalent to a binary reward assignment: 0 for bad response, 1 for good response. RFT takes a partial reward as middle ground. It is actually very similar to the RLT (reinforcement learning teachers) distillation method, which uses the teacher models as heuristics to evaluate the policy rollouts.

Conclusion

The full post-training workflow contains other alignment and tuning techniques, such as rejection sampling, chain-of-thought, thinking mode fusion, and so on. They are devised to improve specific abilities of the models, such as reasoning, instruction-following, and agentic responding. It is the differing ways these techniques are employed that shape the unique personalities of each model, imprinting them with distinctive technological hallmarks of their developers. This study note does not go into those details. As a conclusion, the paper All Roads Lead to Likelihood proves theoretically that all such fine-tuning methods such as DPO, SFT, and RFT are mathematically equivalent.

Fast Data Processing w Ray

2025-08-26T23:06:28+00:00

When I was working on a framework to process a massive amount of data in a producer–consumer, or map–reduce style, I quickly realized the most important challenge was not writing the computation itself, but how to assign work across all workers. The right assignment can keep the system fast and memory-efficient; the wrong one can create stragglers or even out-of-memory errors.

This article walks through the evolution of workload assignment strategies I experimented with in Ray. Starting from the most straightforward static division, I gradually moved toward more dynamic schemes that better handle imbalance and tail effects. Along the way we will see some code sketches and discuss the trade-offs.

Static Assignment

The first attempt was naturally static assignment: divide the dataset into disjoint partitions, each worker gets one partition, and process it. To prevent any single worker from holding too much data in memory, I also added a limit on how many tasks can be in flight. A simplified version looks like this:

num_rows_to_process = dataset_len // num_workers + (
    0 if dataset_len % num_workers == 0 else 1
)

start = worker_id * num_rows_to_process
end = dataset_len if worker_id == num_workers - 1 else (start + num_rows_to_process)

# Control in-flight tasks to avoid OOM
start_indices = np.arange(start, end, max_inflight_data_tasks)
end_indices = np.minimum(start_indices + max_inflight_data_tasks, end)

prepare_list = list(zip(start_indices, end_indices))

This ensures each worker has roughly the same amount of work, and memory usage is bounded. For small workloads, this might be sufficient. But when the workload grows massive, differences in worker speed, network latency, or data skew become significant. The slowest worker dominates the total runtime, and static assignment does not adapt.

Dynamic Assignment with Decaying Chunk Size

The next step was to make the assignment dynamic: instead of giving each worker a fixed slice, let workers ask for work ranges as they go. The manager keeps track of what’s left, and allocates a chunk on request.

A simple heuristic is to reduce the chunk size as the workload gets closer to the end, so the tail is smoother:

class WorkManager:
    def __init__(self, workload: int, num_workers: int):
        self.next_start = 0
        self.chunk_size = max(1, workload // num_workers)
        self.workload = workload

    def get_next_range(self) -> Optional[Tuple[int, int]]:
        if self.next_start >= self.workload:
            return None

        remaining = (self.workload - self.next_start) / float(self.workload)
        chunk_size = max(1, math.floor(self.chunk_size * remaining))

        start = self.next_start
        end = min(self.next_start + chunk_size, self.workload)
        self.next_start = end
        return start, end

The idea looks nice, but in practice this creates imbalance: earlier workers receive large chunks, while later ones get much smaller ones. This unevenness does not eliminate the tail problem; instead, it sometimes makes it worse.

Dynamic Assignment with Granular Splitting

To fix that, I added another parameter: a split factor. Instead of dividing the workload by just the number of workers, divide it further into smaller chunks. Workers still request chunks dynamically, but the granularity is finer:

class WorkManager:
    def __init__(self, workload: int, num_workers: int, split_factor: int = 1):
        self.next_start = 0
        self.chunk_size = max(1, workload // (num_workers * split_factor))
        self.workload = workload

    def get_next_range(self) -> Optional[Tuple[int, int]]:
        if self.next_start >= self.workload:
            return None

        start = self.next_start
        end = min(self.next_start + self.chunk_size, self.workload)
        self.next_start = end
        return start, end

This avoids the worst imbalance, because no worker is locked into a huge chunk upfront. Instead, fast workers can request more chunks and help catch up with the slow ones. But the chunk size is still fixed throughout the process, and we can do better.

Dynamic Assignment with Adaptive Chunk Size

Finally, I arrived at an elastic strategy: dynamic workers with dynamic chunk sizes. The idea is to use larger chunks at the beginning for throughput, then reduce the chunk size near the end to smooth out the tail. This combines the best of both worlds—efficiency in the bulk of the work, fairness in the final stage.

Here is a Ray-actor version:

@ray.remote(num_cpus=0.01)
class WorkManager:
    def __init__(
        self,
        workload: int,
        num_workers: int,
        split_factor: int = 1,
        min_chunk_size: int = 10,
        max_chunk_size: int = 2000,
        tail_percentage: float = 0.2,
    ):
        self.next_start = 0
        self.split_factor = split_factor
        self.min_chunk_size = min_chunk_size
        self.max_chunk_size = max_chunk_size
        self.workload = workload
        self.tail = tail_percentage
        self.num_workers = num_workers

    def get_next_range(self, worker_id: int) -> Optional[Tuple[int, int]]:
        if self.next_start >= self.workload:
            return None

        remaining = self.workload - self.next_start
        estimated_chunks = self.num_workers * self.split_factor

        if remaining < self.tail * self.workload:
            # Tail: smaller chunks to avoid stragglers
            chunk_size = max(self.min_chunk_size, remaining // self.num_workers)
        else:
            # Main phase: larger chunks for throughput
            chunk_size = max(
                self.min_chunk_size,
                min(self.max_chunk_size, self.workload // estimated_chunks),
            )

        start = self.next_start
        end = min(start + chunk_size, self.workload)
        self.next_start = end
        return start, end

This way, fast workers keep the system moving, while the manager automatically shrinks the chunk size near the tail to avoid idle time. It is flexible, memory-safe, and reduces straggler effects.

Balancing workload in Ray is not a trivial problem. A static assignment is simple but brittle; a dynamic one with fixed chunk size improves utilization but can still leave a tail. The most effective approach I found is to make the chunk size adaptive—large in the bulk, small near the end—so that workers are both busy and balanced.

The progression from static to adaptive assignment mirrors a common theme in distributed systems: efficiency requires elasticity. We cannot perfectly predict how each worker will perform, but we can design the system to adapt as the computation unfolds. And that, more than any single heuristic, is what keeps large-scale data processing fast and stable in practice.

Urban Exploration II: San Francisco Bay Area

2025-08-23T14:43:32+00:00

From the naval shipyard that once stood as the Pacific Fleet’s bulwark to the inland quarry carved out of sun-baked valleys, the Bay Area tells its story through ruins. The arc begins with the westward push, when railroads and rocks blasted open the hills. It swells in the years of war, when bunkers multiplied on the coast and cranes rose over the shipyards. It ends in glass towers and the circuitry of software empires, where industry no longer stains the hands but writes itself invisibly into code. The tides of history have come and gone, leaving their wreckage in plain sight—structures abandoned, scattered like bones along the water’s edge. Come with me, then, from north to south, across the forgotten corners of San Francisco Bay.

San Francisco’s Pacific Sentinels

The west edge of San Francisco faces the Pacific, cliffs absorbing the first storms, bunkers and batteries crouched like sentries. They were poured into the sand as the mainland’s defense. Now they sit in silence, dark mouths opening to the wind.

The piers stretch across the waterfront like a timeline. Some dazzle with tourists; others still labor, hauling cargo and goods. But in the southeast, war once held sway. Potrero Point and Hunters Point wore their warehouses like armor. Pier 90 is the chosen arena for competitive graffiti artists, where huge murals defiantly showing off the nerve of their creators.

Abandoned warehouse on SF piers; Pier 90 silos

Hunters Point—the largest naval shipyard on the coast—still broods over the shoreline. Its giant crane looms above decaying hangars and barracks, casting long shadows over land once measured for radioactive waste. Inside, a warehouse glimmers with glass latticed in wire. Climb the crane, if you dare, and the city unfurls, restlessly teeming beneath your feet.

The US Naval Radiological Defense Laboratory

From top to bottom: Hunter's Point factories; the "glass house", for periscope calibration, with amazing optical effects; the symbolic crane, where teenagers sneak in to climb

Old Tracks and Dark Tunnels

Southward, in Brisbane, the remnants of the old railway lie in weeds. The Bayshore Roundhouse at South San Francisco sits like a discarded shell, a leftover from another line.

The abandoned South San Francisco Bayshore Roundhouse

Further down the peninsula, the ground opens into drainage tunnels—labyrinths stretching for miles. In summer, when the channels run dry, kids slip inside for “sewage surfing.” But in the rains, the tunnels flood without warning, turning play into a trap.

A drainage tunnel underneath Los Altos

Silicon Valley’s Ghosts

In Silicon Valley, land rarely stays idle. Yet even here, ruins remain. The Agnews Developmental Center—the biggest insane asylum in the region—divided into two halves: one razed and memorialized under Oracle’s vast campus; the other still trapped inside Cisco’s empire, a relic fenced and surrounded by parking lots and schools. On the nearby Lafayette Street, the roads diverge: south into San Jose’s blight, north into Alviso, a marshland town where shacks lean against the tide.

Agnew Mental Hospital

Alviso feels like another country, half drowned, its streets ending in wetlands. Just beyond, the ghost town of Drawbridge sinks into mud, roofs collapsing into the swamp. Amtrak trains rush past, offering only glimpses of a place almost gone.

The forgotten city Alviso

East of here, at the border of Milpitas, the Oak Creek Business Park decays in slow motion. Graffiti blooms across its walls, windows shatter, fences multiply. The silence of its offices is louder than the work that once filled them.

Broken windows of a former community college building

Bridge That Ends Midstream

Going around the bay’s mouth, we enter the territories of Fremont and Newark. The famous Dumbarton Bridge connects this part of Bay with its pitiful counterpart, East Palo Alto, across the water. Few notice the other bridge—its twin, the railway span, burned by kids in the 1990s and left split, unrepaired, over the turbid water. Nearby, pipes and a control house stand sentinel, their machinery rusting, their functions forgotten.

The pipes and control room

The broken bridge

Oakland: the Apocalyptic City Walk

Oakland unravels in plain sight. Due to its growing safety concerns and rampant gang activities, population drifts away. Buildings empty. Greyhound stations, high schools, service centers—all left behind. The 16th Street Railway Station still towers, a husk under constant surveillance, a cathedral of concrete and silence.

16th Street Railway Station, the façade and interior

The Islands of Abandonment

Across the tunnel, Alameda’s island bears its own ghosts. The navy is gone, the barracks face the City Hall like hollow eyes. The runway is locked, except on the first Sunday of each month, when a flea market swarms the concrete pavement.

The runway; the barracks on Alameda Point

Further west is Alameda Point—where the USS Hornet rests, and hangars stretch into emptiness. Windows are shattered, gates closed. Stories cling to the walls—of chemicals, radiation, and contamination—whether true or rumor, they are enough.

The hangars and USS Hornet

Graffiti Belt to the North

Further north, the rails lead into Emeryville and Berkeley. Factories collapse under waves of graffiti, a mural that crawls from wall to wall until it bleeds into Richmond. There, near the docks, sits a chemical lab. Its doors are boarded, warning signs hang loose, walls stained. Maybe it once made medicine. Its past may lie in pharmaceuticals; its present is simply this: abandoned, facing the bay that outlasts every builder.

Warehouses in Berkeley and Richmond

What the Ruins Tell

The Bay speaks in silences: of shipyards that built might, of tunnels that carried water and now carry only echoes, of campuses and military grounds that gave way to technology. Each broken window and rusted pipe is a sentence in that history. These are not just forgotten spaces—they are prophets of neglect, monuments to transitions we prefer not to name.

曾经的家

2025-08-12T19:52:45+00:00

To my wife, and my newborn baby

她软绵绵的肚子曾经是他的家
Her cloud-soft belly, once his nest;

现在的他搬家了留下空荡荡的房
He’s moved out now; the room’s at rest.

我对着房子说话还有阵阵回响
I speak into the house; the echoes roll—

这声音轰隆隆的也能传到他心上
boom, and settle inside his soul.

A Bug in Numpy?

2025-07-08T00:23:30+00:00

When I was working with high-dimensional NumPy array manipulations to process road agent interaction data, I encountered some unexpected behavior: for arrays with more than three dimensions, certain slicing operations seemed to cause axis reordering, resulting in downstream shape mismatch errors. For example:

>>> x = np.random.random((1,24,5,6))

# trying to squeeze first dimension and select 2-6 from last dimension
>>> y = x[0, :, np.arange(5), 2:6]
# but surprisingly it does not return (24, 5, 4)!
>>> y.shape
(5, 24, 4)
>>> x.shape
(1, 24, 5, 6)
# expected shape after transpose the result
>>> y = np.transpose(o[0, :, np.arange(5), 2:6], (1,0,2))
>>> y.shape
(24, 5, 4)

This inadvertent axis flip looks very suspicious, and I would not let it slip without finding a reasonable explanation. As a mature, well-maintained, and widely-used library for basic computation use, NumPy is unlikely to have committed such a simple mistake, so there must be a reason behind this phenomenon, or to put it more professionally, to make this deliberate design choice. Time for some digging in.

As I play around more with even higher dimension of indexing, I found it’s not just the axes that got flipped, but even the expected shape might be different:

>>> x = np.arange(48).reshape(3, 4, 4)  # shape: (3, 4, 4)

# Expecting 3 x 4 x 2
>>> y = x[:, [[0,1],[2,3]], [0,1]]
# but instead, got 3 x 2 x 2
>>> y.shape
(3, 2, 2)

so why does this happen? To investigate further, I tried following:

>>> w = np.arange(1080).reshape((3,12,6,5))
>>> u = w[0, :, :5, :4]
>>> v = w[0, :, range(5), :4]
>>> q = w[0, :, :5, range(4)]
>>> t = w[0, :, range(5), range(4)]

guess what are the shapes?

>>> u.shape
>>> (12, 5, 4)
>>> v.shape
>>> (5, 12, 4)
>>> q.shape
>>> (4, 12, 5)
>>> t.shape # raises IndexError due to (5,), (4,) shape mismatch

Only u returned the expected shape. Both v and q reordered the axes, and t failed due to incompatible broadcasting. Here we can see some common patterns: seems like the way we slice the array makes a difference. Whenever I use range and [], the output shapes are not as expected. It seems like such slicing method will always be placed at the front of all axes!

With a bit of confirmation from NumPy documentation, I found following pattern:

There are two types of indexing: 1. basic indexing, including single element indexing (integer value index), slicing and striding (using start:stop:step syntax), and 2. advanced indexing, including array and range indexing. Using range or indexing array triggers the advanced indexing.
When both basic and advanced indexing are used, NumPy groups all advanced indices, broadcasts them together, and places their resulting shape at the front of the result.

This corresponds to the shapes of u, v, q . t has shape mismatch because NumPy cannot broadcast shape (5,) and (4,) together. So in the case of y = x[:, [[0,1],[2,3]],[0,1]], it works as follows:

the array contains two instances of advanced indexing, [[0,1],[2,3]] with shape (2,2) and [0,1] with shape (2,), which together broadcast to (2,2). The basic axis 0 is then appended in the end, resulting in the shape of (2,2,3). But wait—what does broadcasting index mean? It is much more intuitive than it appears: creating a grid of index tuples, just like np.meshgrid. In this case, we are combining the two advanced indices to construct one array of index pairs which in turn just selects [[x_[0, 0], x_[1, 1]], [x_[2, 0], x_[3, 1]]] and stack them along the original axis 0.

But why? The core motivation is to eliminate ambiguity. Consider an array x with shape (A, B, C, D). Suppose we want to slice it by x[:, ind1, :, ind2], where:

ind1.shape == (M,)  # for axis 1
ind2.shape == (N,)  # for axis 3

Should this return a shape (A, M, C, N)? Seems natural, until:

ind1.shape = (M, 1)
ind2.shape = (1, N)

These now broadcast to shape (M, N). But once broadcasted, the origin of each dimension (which axis it was meant for) is lost. We have a (M, N) block of coordinate pairs, and we can no longer assign one dimension to axis 1 and the other to axis 3. There is no clean rule to do so — especially for non-adjacent axes.

In order to eliminate this ambiguity, NumPy strictly enforces the rule of appending the original basic indexed dimensions after the processed advanced indexing:

When the advanced indices are separated by a slice, Ellipsis or newaxis. For example x[arr1, :, arr2]. The dimensions resulting from the advanced indexing operation come first in the result array, and the subspace dimensions after that.

When the advanced indices are all next to each other. For example x[…, arr1, arr2, :] but not x[arr1, :, 1]. The dimensions from the advanced indexing operations are inserted into the result array at the same spot as they were in the initial array (the latter logic is what makes simple advanced indexing behave just like slicing).

Understanding this rule is critical when working with high-dimensional array manipulations. With this knowledge, we can reshape or transpose the result as needed to get the desired layout.

Urban Exploration I: What is it, and Why

2025-03-07T22:26:50+00:00

On a normal Tuesday, Oct. 17th, 1989, an earthquake of 6.9 Mw magnitude, originated 19 km below ground surface, 16km northeast of Santa Cruz, swept from Loma Prieta Peak all the way to the north of San Francisco Bay, resulted in thousands of casualties. Besides over 4,000 landslides and broken sections of the SF-Oakland Bay Bridge, this disaster also severely damaged the Southern Pacific 16th Street Station in west Oakland, 90 kilometers away from the epicenter. Originally built and opened in the late 19th century, this station has served as the main rail link for points north and east of Bay Area. As a minor consequence of earthquake aftermath, comparing to other horrendous ground failures, the railway station continued its operation in an adjacent building, until Aug 21, 1994, when the Coast Starlight and California Zephyr made their last stops. The station was then closed, and the railway tracks were removed during the construction of I-880 highway, detaching it from the Bay Area rail network. The tech boom in the coming era sent the station into oblivion: forgotten by the locals, succumbing to its desolation.

Today, the station is surrounded by barbed wires and fences, with wood planks sealing all entries and windows, weeds and branches encroaching every inch of its perimeter. Two camera poles with megaphones attached surveil the front and back side of the building, warning signs intimidating potential intruders. Next to it are brand new apartments and flower farm, casting a giant contrast between the two worlds. We were not expecting such high level of security upon arrival. There were three holes in a section of the fence, but the camera behind it made it merely impossible to enter the main hall without being detected.

The 16th street railway station, Oakland

This is what urban exploration, or “urbex” for short, is about: the practice of entering vacant, uninhabited, or abandoned sites, for the purpose of exploration and documentation.¹

It is a creative process, that requires location searching, information gathering, plan making, mixed with a sense of defiance to the authority, as well as the adrenaline rush facing the unknowns. It is a highly interdisciplinary field, where knowledge in history, social science, geoscience, dendrology, architecture, urban engineering, economics, and even biology are preferably required, for safe and fruitful explorations. Urban exploration, while hovering on the edge of legality, with risk of trespassing, does not equivalent to vandalism. Different from tagging (graffiti) or squatting (illegally residing in), it focuses on documentation and the experience itself, rather than modifying the state of the sites.

The term “urban exploration” is to some extent a misnomer, because the sites are not necessarily in urban area. I devised a planar spectrum to classify and illustrate their types by mapping a set of common categories onto the quadrants.

"Bando" sites categories projected

The vast uncertainties during the exploration imbue the activity with a sense of adventure. Sites are usually wrapped in layers of protection, so finding an access point means a security breach. Due to lack of maintenance, movement on the unstable structures is often a seismic gamble. There could also be unexpected encounters: unfriendly, paranoid dwellers lurking in the dark; surprise motion sensors that dispatch police or security guards without warning. This lingering sense of challenge is what drives curious onlookers away from this hobby. For the ones who stay, the tangible existence at the location, spatially and temporally, nevertheless, evokes a euphoria, a shivering excitement, akin to another world. This sensation is likely triggered by the emancipation from the regulated, constrained, or even monitored, intentionally or inadvertently by others, regime of movement under urban settings. In a bustling crowd, everything is under scrutiny, where “unsociable” activity is demeaned, with the danger of being tagged as “bizarre” or “erratic”. These conventions are thus internalized by the unwitting individual, that no unexpected expression or behavior ever occurs. Abandoned sites are spaces that exist beyond such conventions. They grow interstitially in the absence of urban regulations, as terrain vague, become rightfully obsolete and unproductive, and manifest themselves as spaces of freedom that are an alternative to the lucrative reality prevailing in the late capitalist city. They are anonymous realities.²

An abandoned warehouse drenched in graffiti, Richmond

Without the designated routes and pattern, abandoned sites, with their crumbled ceilings, leaky drainage pipes, hanging beams, and often cracked walls, present themselves as labyrinths that reconfigure the topology to contradict with our daily experience; previously impassable paths are accessible, whereas the familiar stairways and signs lead to cul-de-sacs. The physical movements forcibly adapt to a more volatile, improvisatory style, yet the sensual affordances remain heterogeneous, varying particularly with the type of sites. The following are three main categories.

An abandoned grain silo in San Francisco

Industrial Ruins.

The old, once vivid industrial regions are always crowned with despondent names: Rust Belt for the US midwest, Il Triangolo dela Morte (Triangle of Death) for Campania in Italy, Pays Noir (Black country) for coal belt in Belgium and France, etc. A multitude of industrial sites sprawled along the railway, weaving through the roaring steam engines, in a steampunk style. Yet the good old days are gone. They exude a smell of desperation, futility, in the dark and damp corner of history and land, lingering as a scar. They are martyrs of a once-gilded era, figures of the rampant development. Such symbolism is widely employed in the lenses of Chinese directors: movies such as Black Coal, Thin Ice, Piano in a Factory, and The Looming Storm all have their stories settled in the Rust Belt of the Northeastern provinces, and their characters’ fate is correlated with that of the doomed factories. In Black Coal, Thin Ice, the old, dilapidated factory casts a cold and grey tone over the story’s backdrop. It starkly contrasts with the vibrant, yet eerily lurid neons of the dance hall, carving out the contours of a magical realist narrative. I realized that my preference for such movies is rooted in the obsession with the industrial scenes, especially with the collieries, refineries, and steelworks. The colossal machineries, such as the blast furnace, the converters, and gantry cranes, are lurking behemoths, seething in silence. Back then, they were fully operating, slowly and clumsily, but full of strength, with that shaking energy and ominous voice, that no one dares to tame their temper. Now they are silent, static, but this giant amount of power still oozes out from their hulking bodies, creating an irresistible fear. It is electrifying to see such monolithic mass indulging in its obsolescence, being brazenly unproductive, exposing, or even proudly manifesting its own decay.³

Kitchen of an abandoned mental asylum, Boston

Private Space.

A cozy, personal, familiar place conjures a very different atmosphere. Houses, theaters, hotel rooms, unlike the metal beasts from industrial ruins, are scenes in our daily life and common practice, and their decay arouses the fear in ourselves, because they provide a showcase, that without proper maintenance, how the ultimate condition of our reliable environment would become. Many such places remain the temporality when they were abandoned; the CDs on the stand, the paper and photos on the desk, plates in the sink, half-eaten bag of chips, all seem to show acquiescence that the routine is merely suspended, that the owner might return at any moment. The stillness of time is laid bare within this stagnant cocoon. The display of an abrupt manner of withdrawal raises fatal fascination for explorers; they are located exactly at spatial coordinates in the space-time reference frame, where the only difference is on the time axis. History never truly fades, as the rising entropy in the closed room inscribes every fragment of the past. That aligns with people’s investigative penchant for finding evidence of existence.

Schools, hospitals, or even sanatoria, on the other hand, may trigger horror in their abandoned state. This horror is not as tightly related to their original functionalities; it is more of a direct consequence from their liminal nature, and the so-commonly-perceived supernatural presence. As someone with a materialist perspective, the former concept is apparently a better explanation. Liminal space is a state of transition, a space in-between, typically with a surrealistic touch of disorientation. The corridors, stairwells, hallways are transitional places connecting functional spaces, once devoid of people, they appear eery and forlorn. Their abandoned states adds an additional layer of purposelessness; the corridor no longer leads to a destination with defined purpose, but to the elusive, dark unknown.

Liminal space, hotel and airport

Open-Air Relics.

From ancient villages to modern day bunkers, relics are remains of history, in a sense that they either served a special, yet now-obsolete functionality, or lost their importance, gradually withered until buried in time. The abandonment took place on an earlier time scale, and their decay is now in its final stage, where past usage is reconstructed through imagination and historical footnotes. Relics lie on the border of definition of urban exploration, and is a good introductory course of newcomers. Take a walk through the relics, away from the crowds. In a site like Machu Picchu or Easter Island, the spiritual resonance with those who labored and lived here centuries ago might strike your soul. When the abandonment is more recent, this feeling intensifies, hence the irresistible allure of urban exploration. Nowadays, places like Machu Picchu and Easter Island are never referred to as “abandoned sites” or “ghost towns”, and more specifically, never in the history either. The notion of abandonment, as understood today, is largely a byproduct of urban productivity within the framework of capitalism—something that did not exist in relics’ society. Likewise, modern-day bunkers are not truly “abandoned”, as their purpose is inherently tied to a specific moment in time.

RAF Stenigot, radar dishes in a field. Photo credit: Rick Nunn

This concludes the brief introduction of urban exploration. The 16th street railway station, with years of advocacy from Oakland Heritage, finally gained recognition from National Register of Historic Places. Multiple reactivation plans have been proposed, and it will likely revive soon. But not all abandoned sites receive this recognition. Many of them either remain forgotten, slowly covered in graffiti and crumbling under vandalism, or are demolished once the authorities grow tired of such activities. This is the life cycle of manmade architecture. There are thousands of them dying, and thousands being renovated and reborn every day. The urban complex is a giant organism where growth and decay occur simultaneously. So is everything else. Being able to take a peek at their post-mortem stage is a rare and cherished privilege. It is also our responsibility to document with meticulous care, regardless of what the state might be. This is why we love urbex.

Elizabeth Blasius, Urban Exploration as Creative Practice, MAS Context, 2024. ↩
SOLÀ-MORALES RUBIÓ, Ignasi de, Presente y futuros. La arquitectura en las ciudades. In AA. VV., Presente y futuros. Arquitectura en las grandes ciudades, Barcelona: Collegi Oficial d’Arquitectes de Catalunya / Centre de Cultura Contemporània, 1996, 10-23. ↩
Ninurta, Walking in the post-apocalyptic world, https://www.youtube.com/@Ninurta_Urbex. ↩

Frame Transform Fun

2024-12-17T23:00:00+00:00

Transform is a crucial tool in the realm of robotics and many others. From camera calibration to object grasping, it is extensively used through the entire pipeline. The conventional transform between static frames, however, is presumably elaborated in many sources, and our topic is hence about something more dynamic and interesting. Before we start, as usual, we have to set the notation for our discussion. Define the symbol $^BP_A$ as state $P$ in frame $A$ viewed in frame $B$, that is, the transform of $P$ from frame $A$ to $B$.

The topic we want to discuss is relative motion. The static poses and frames transformation are easy; no matter how many transforms there are, just link them together and match the superscripts with subscripts. For motion, the time derivatives of distances, simple matrix multiplication is not sufficient. We start from two observers, A, and B, looking at some moving object M, simplified as a particle. Given A’s observation of M’s translation ${}^Bv_M$ and self-rotation ${}^B\omega_{M'}$, we want to deduce B’s observation. This is a trivial problem, because A and B are both static, it’s merely a different point of view:

\[\begin{align}\begin{bmatrix}{}^{B}v_M\\{}^B\omega_{M'}\end{bmatrix} &= \begin{bmatrix}{}^BR_A & 0\\0 & {}^BR_A\end{bmatrix} \begin{bmatrix}{}^Av_M\\{}^A\omega_{M'}\end{bmatrix}\end{align}\]

here we use apostrophe $M'$ to indicate self-rotation of M, to differentiate from ${}^A\omega_{M}$, which means the angular velocity of particle M rotating around frame A’s axis.

What if, say, one observer, A, is also moving, translating or rotating by itself? This makes frame A non-inertial, that is, it does not necessarily exhibit inertia. It could be accelerating arbitrarily, and we cannot perform any useful analysis. In an accelerating car, objects will appear to start moving without external force. We have to find another inertial frame, that is either static or in uniform motion, and in our case, that is observer B.

How we pick this frame of reference creates problems. No matter how “relative” we get, when we talk about velocity, we need to have one observer inertial and evaluate the velocity of the other in its frame, otherwise the notion of “velocity” doesn’t make sense. But how do we know if any frame is actually inertial? For a robot arm, it could be the robot base. If the arm is mounted on a moving base, then somewhere on the ground. If the vehicle is an aircraft, then somewhere on the earth, but we know for a fact that earth is moving non-uniformly in the universe. Then there’s the sun, the Milky Way, the Local Group, the Virgo Supercluster, the Laniakea Supercluster. Seems like everything is non-inertial, and this caused widespread panic, because people like stability and controllability; such a dynamic and chaotic view of the universe is unacceptable. Where “on earth” can we find the real inertial frame of reference, and regulate the motion once for all?

About a hundred and fifty years ago scientists faced the same problem. In the era of exploding physics development, the discovery of electromagnetic waves and the nature of light incurred many realistic issues. How does light travel in the vast space? Suddenly the choice of “earth” as a reference frame becomes negligible on the universal scale. They coined the term Ether, referring to the media for light transmission, as an absolute static frame of reference of the universe, for the peace of mind. That’s where Hilbert Einstein came into play. He stated the Equivalence Principle, which is built upon Galileo’s Weak Equivalence Principle, and also a crucial founding stone for his later Theory of Relativity. The principle itself focused on the equivalence between a gravitational field and an accelerating frame of reference—imagine a rapidly upwards accelerating elevator in deep space, and a human on earth, both seeing the same behavior of object falling to the ground. We can further extend this principle and interpret it as, a raindrop falling towards the earth, can view itself as stationary, while the earth is rushing towards it. There is no notion of global inertial frame, and we do not need it. We only need to find the inertial reference frame for local analysis.

With that in mind, going back to the goal of the original problem, we want to describe the motion of M in the inertial frame, B, through A’s observation of M. Starting from the simplest assumption, where there is only linear velocity of A and M, we can write the velocity of M in inertial frame B as

\[{}^Bv_M = {}^Bv_A + {}^BR_A{}^Av_M\]

For simplicity, we sometimes drop the superscript of the inertial frame, and replace it by the object frame. We will, however, keep the full notation to reduce confusion.

This concludes the linear velocity part. If we consider angular velocity, things get more involved. It was a bit handwaving in the previous example, but now, when it gets serious, to not further confuse ourselves, we have to define the terms. Rotation, or self-rotation, refers to the object spinning around its own axis; revolution refers to the object orbiting around an axis in another frame. Thus an object can have angular velocity wrt its own frame, but it cannot have linear velocity wrt its own frame, because nothing can translate relative to itself. Note that when describing self-rotation, the object’s original frame of reference (neutral pose of zero rotation) should not change as a reference. In our discussion, we use ${}^A\omega_{A'}$ to denote object A’s self-rotation, and ${}^B\omega_A$ for object A’s revolution about some axis in inertial frame B.

Another important point is that, since we are describing the motion in the frames using linear velocities ${}^Bv_A$ and ${}^Av_M$, there is naturally no revolution angular velocity ${}^B\omega_A$ and ${}^A\omega_M$ in the picture. Only self-rotation is allowed, because an object is able to translate and self-rotate at the same time, but not translate and revolve at the same time, since an arbitrary translational motion will conflict with the revolving motion. A pure revolution is a special case, but also falls under linear description framework using $v = \omega \times{}r$, where $\omega$ is equivalently expressed by the linear velocity. We do consider the self-rotation ${}^A\omega_{A'}$ and ${}^M\omega_{M'}$, but assuming there is no internal angular acceleration either.

Let us continue the analysis. If we look at ${}^A\omega_{A'}$ by itself, we can see that it contributes to the linear velocity of M in frame A. We then convert this velocity into inertial frame B:

\[{}^BR_A({}^A\omega_{A'}\times{}^{A}t_M)\]

Note since frame A’s origin does not change when rotating, ${}^{A'}t_M$ and ${}^{A}t_M$ are equivalent. ${}^BR_A$ represents the rotation of reference A in frame B, as mentioned above, it remains constant as a reference. The only time-varying component is ${}^At_M$. For the ${}^M\omega_{M'}$ term, self-rotation of M does not affect its linear velocity in B. We can also rewrite the self-rotation in its own frame with one superscript of the frame, ${}^A\omega$ and ${}^M\omega$. We will keep using the apostrophe when indicating self-rotation viewed in another frame.

Since each velocity component acts independently at any given time, the problem is linear, and we can simply add up all the components for the final linear velocity (no pun intended):

\[\begin{equation}{}^Bv_M = {}^Bv_A + {}^BR_A{}^Av_M + {}^BR_A({}^A\omega\times{}^{A}t_M)\end{equation}\]

The next part is angular velocity ${}^B\omega_{M'}$. The biggest contributor is obviously M’s self-rotation, observed by A:

\[{}^BR_A{}^A\omega_{M'}\]

Similar to the linear velocity equation, which contains A’s linear velocity, the final angular velocity also contains A’s angular velocity in frame B:

\[\begin{equation}{}^B\omega_{M'} = {}^BR_A{}^A\omega + {}^BR_A{}^A\omega_{M'}\end{equation}\]

We can rewrite the expressions in matrix form:

\[\begin{equation}\begin{bmatrix}{}^Bv_M\\{}^B\omega_{M'}\end{bmatrix}=\begin{bmatrix}\mathbf{I} & {}^BR_A[{}^Mt_A]_\times\\\mathbf{0} & {}^BR_A\end{bmatrix}\begin{bmatrix}{}^Bv_A\\{}^A\omega\end{bmatrix}+\begin{bmatrix}{}^BR_A & \mathbf{0}\\\mathbf{0} & {}^BR_A\end{bmatrix}\begin{bmatrix}{}^Av_M\\{}^A\omega_{M'}\end{bmatrix}\end{equation}\]

Pay special attention to the cross product part in equation (2). We arrive at the final expression by doing following conversion:

\[\begin{align}{}^BR_A({}^A\omega\times{}^At_M) &= ({}^BR_A{}^A\omega)\times({}^BR_A{}^At_M)\\ &=[{}^BR_A{}^At_M]_\times ({}^BR_A{}^A\omega)\\&={}^BR_A[{}^At_M]_\times{}^AR_B({}^BR_A{}^A\omega)\\&={}^BR_A[{}^At_M]_\times{}^A\omega\end{align}\]

where step (5) is based on $\mathbf{R}(a\times b) = (\mathbf{R}a)\times(\mathbf{R}b)$, step (7) is from $[\mathbf{R}v]_\times=\mathbf{R}[v]_\times \mathbf{R}^T$.

There is an analogy to the relative motion transformation, but not exactly the same. The velocity of a point can be described by 3d linear velocity and 3d self-rotation angular velocity; another important physical quantity, wrench, also describes the state of a point using 3d force and 3d torque, imposed on the point. Same as the velocity example, if the point P is not rigidly attached to its observer O, then the wrench observed in frame O is merely a viewpoint transform,

\[\begin{align}\begin{bmatrix}{}^{O}F\\{}^O\tau\end{bmatrix} &= \begin{bmatrix}{}^OR_P & 0\\0 & {}^OR_P\end{bmatrix} \begin{bmatrix}{}^PF\\{}^P\tau\end{bmatrix}\end{align}\]

Similar to the leap we took in the velocity example, where we made observer A moving, in this case, if P and O are rigidly attached, when a wrench is exerted on point P, what is the corresponding wrench on O? The deduction process is comparable to that of velocity, and readers can try on their own. Here we give the solution:

\[\begin{align}\begin{bmatrix}{}^{O}F\\{}^O\tau\end{bmatrix} &= \begin{bmatrix}{}^OR_P & 0\\ \left[{}^Ot_P\right]_\times{}^OR_P & {}^OR_P\end{bmatrix} \begin{bmatrix}{}^PF\\{}^P\tau\end{bmatrix}\end{align}\]

This similarity and difference between these two cases hint the physical nature behind velocity and wrench. We may dig deeper in this direction, and combine it with an analysis on acceleration terms in future articles.