Shaoyang Cui

VidNum1.4K - A Comprehensive Benchmark for Video-based Numerical Reasoning

Fri, 03 Apr 2026 00:00:00 +0000

This research introduces VNum, a comprehensive VideoQA benchmark containing 1,379 human-annotated video-question pairs designed to test multi-step numerical reasoning in Vision-Language Models (VLMs). Moving beyond simple counting, VNum spans diverse real-world environments to quantify objects, actions, and events through a unique three-level hierarchy.

ClawTrap - MITM-Based Red-Teaming for OpenClaw Security Evaluation

Thu, 02 Apr 2026 00:00:00 +0000

This research introduces ClawTrap, a MITM-based red-teaming framework designed for the real-world security evaluation of autonomous web agents like OpenClaw. To bridge the gap between static sandbox testing and live network threats, ClawTrap provides a reproducible pipeline for rule-driven interception, transformation, and auditing at the network layer.

My Failures

Tue, 24 Mar 2026 00:00:00 +0000

日拱一卒,功不唐捐

Sun, 22 Mar 2026 00:00:00 +0000

Learning Notes - the Flash-Attention

Thu, 19 Mar 2026 00:00:00 +0000

Motivation

Everything started with a phone call. My girlfriend asked me:

“Honey, you’ve mentioned that you successfully installed FlashAttention on your new cluster. What is that?”

“Oh, it’s a Python library that helps with LLM inference, and maybe training too. Without FlashAttention, LLMs can easily run into CUDA OOM during inference.”

I answered without really thinking.

“Cool. How does it work?”

“…”

And then I realized: I actually had no idea how FlashAttention works, or why it reduces the memory complexity of LLM inference.

That was the moment I decided to really look into FlashAttention, build my own blog page, and start recording the important technical ideas I learn along the way.

And also, always remember to be someone who keeps asking why.

Background

After searching, I found that FlashAttention reduce the memory complexity of LLM inference mainly through optimize the calculation of [Softmax]

In LLMs, softmax appears in more than one place, but the one that matters most for understanding FlashAttention is the softmax inside the attention mechanism.

At a high level, self-attention works like this: for each token, the model computes how much attention it should pay to every previous token. Those raw attention scores are first computed, and then softmax turns them into a proper probability distribution.

More concretely, given query, key, and value matrices $Q$, $K$, and $V$, the attention scores are:

$$ S = \frac{QK^\top}{\sqrt{d_k}} $$

where $M_{ij} = -\infty$ for positions that should be masked.

Then softmax is applied row by row over the key dimension:

$$ P_{ij} = \frac{\exp(S_{ij})}{\sum_j \exp(S_{ij})} $$

This step converts the raw scores into normalized attention weights. Each row now sums to 1, which means the model can interpret them as “how much this token attends to each previous token.”

Finally, these attention weights are used to combine the value vectors:

$$ O = PV $$

So the full attention pipeline is:

$$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$

What softmax is doing here

The role of softmax is simple but important:

It turns arbitrary similarity scores into non-negative normalized weights.
It amplifies larger scores and suppresses smaller ones.
It makes the model focus on the most relevant context tokens when forming the output.

Without softmax, the attention scores would just be raw numbers. The model would have no clean way to interpret them as relative importance.

Why this matters for FlashAttention

This is exactly the expensive part in standard attention.

To compute attention in the usual way, we often materialize the full score matrix $QK^\top$ and then the full softmax result. If the sequence length is $N$, then this matrix is of size $N \times N$, which leads to very high memory cost.

FlashAttention does not change the mathematical definition of softmax. Instead, it computes the same result more carefully, in a tiled / blockwise way, so that it avoids storing the full attention matrix in GPU memory.

So when people say FlashAttention reduces memory usage, they do not mean it removes softmax. They mean it computes the same softmax-based attention more efficiently.

At first glance, a single attention head does not seem too expensive.

For example, if $L=512$ and each number takes 2 bytes, then one $L \times L$ score matrix costs:

$$ 512 \times 512 \times 2 = 524288 \text{ bytes} \approx 0.5 \text{ MB} $$

However, in a real transformer we usually have multiple heads, multiple layers, and additional tensors such as attention probabilities, Q/K/V activations, and KV cache.

If we include the head dimension, the score matrix memory for one layer is roughly:

$$ B \times H \times L^2 \times \text{bytes per element} $$

For example, with $B=1$, $H=32$, $L=4096$, and 2 bytes per element, the attention score matrix alone is already about 1 GB for a single layer.

And this is only one intermediate tensor. That is why attention becomes a serious memory bottleneck for long sequences.

Why do we have to store such a large tensor?

$$ P_{ij} = \frac{\exp(S_{ij})}{\sum_j \exp(S_{ij})} $$

In a naive implementation, we usually store both $S$ and the normalized matrix $P$. Why? Because softmax is not just a pointwise operation. For each row, it needs to look at all the elements in that row, compute their relative scale, and normalize them. So the implementation often materializes the whole matrix first, and only then moves on to the next step.

After that, we compute:

$$ O = PV $$

So in the standard pipeline, the model often stores large intermediate tensors such as:

the score matrix $S$
the softmax result $P$
and sometimes extra temporary values used for numerical stability

This is why attention becomes so memory-hungry. The real issue is not just the formula itself, but the way the computation is scheduled and stored in memory.

And this is exactly where FlashAttention comes in.

The key idea of FlashAttention is: maybe we do not need to store the full $L \times L$ matrix at all. Instead of materializing the whole attention matrix in GPU memory, FlashAttention computes attention block by block, while still producing the exact same final result.

In other words, FlashAttention does not change the mathematics of attention. It changes the order of computation, so that the GPU does much less memory movement and avoids storing those massive intermediate tensors.

Awards

Wed, 18 Mar 2026 00:00:00 +0000

Learning Notes - Criticality Stage

Wed, 18 Mar 2026 00:00:00 +0000

New Concepts to Me

Leaky - Integration - Firing Neuron

A widely-used neuron model that capture more specific computational character of real neurons.

AdEx: Adaptation Exponential LIF

Compared to HH-Model

Learning Notes - Grid Cell Coding Theory

Wed, 18 Mar 2026 00:00:00 +0000

Date: 2026-3-27

An overview of grid cell.

Discovery and Observations

My Blogs

Wed, 18 Mar 2026 00:00:00 +0000

Papers

Wed, 18 Mar 2026 00:00:00 +0000

Talks

Wed, 18 Mar 2026 00:00:00 +0000

Reading Notes - Neuroscience Inspirations for AI

Thu, 12 Mar 2026 00:00:00 +0000

I reviewed several materials on cortical computation and noise robustness.

What stood out:

Biological systems rely on redundancy and local adaptation, not only scale.
Robustness is often an emergent property of system design, not a single trick.
Interpretability improves when models are constrained by plausible mechanisms.

Next step: prototype a small benchmark that compares standard and bio-inspired architectures under perturbations.

TradeCraft - Exploring Theory of Mind in LLM Agents' Strategic Decision-Making and Communication

Mon, 01 Dec 2025 00:00:00 +0000

LLM agents’ reliance on implicit Theory of Mind (ToM) during strategic decision-making remains debated. We investigate this using a Minecraft-inspired “trade-and-craft” game requiring goal inference and item exchange. By augmenting agents with explicit ToM scaffolding—where players report multi-order beliefs about opponents—we evaluate the alignment between inferred mental states and behavioral outcomes.

Task Ability Decomposition and Difficulty Quantification of Visual Tasks for AGI Evaluation

Mon, 27 Oct 2025 00:00:00 +0000

This work represents a significant advance in AGI evaluation methodology by providing the first comprehensive framework for understanding and quantifying visual task difficulty.

Key Contributions

Novel Theoretical Framework: First exploration of task-ability space structure and its relationship to task difficulty
TADDL-V Framework: Systematic approach for quantifying difficulty of visual tasks
AGI-V70 Benchmark: Curated dataset for testing diverse visual abilities
Practical Impact: Tools and methods that advance the field of AGI evaluation

Motivation

Using the visual domain as a starting point, this research addresses a critical gap in AGI evaluation by introducing a methodology to quantify the difficulty levels of composite tasks. This quantification is crucial for conducting a more comprehensive and fine-grained assessment of AGI systems.

To promote open science and collaborative advancement, the TADDL-V framework and the AGI-V70 benchmark are made freely available to the research community.

Visual teaser

FAB - Factory of Abstract-style Benchmark

Fri, 01 Nov 2024 00:00:00 +0000

Developed the first fully automated, low-cost benchmark generation framework for abstract-style evaluation across general-purpose domains. Enables scalable testing of large language models using structured abstraction errors, covering semantic, structural, and factual variants. Repository: https://github.com/spidermonk7/FAB-Benchmark

Projects

Sun, 19 May 2024 00:00:00 +0000

Experience

Tue, 24 Oct 2023 00:00:00 +0000

IJCAI Chinese Standard Mahjong AI Competition

Thu, 01 Jun 2023 00:00:00 +0000

Achieved outstanding performance in the prestigious IJCAI Chinese Standard Mahjong AI Competition across two consecutive years:

2023: 7th Place - IJCAI2023 Special Track 2022: 10th Place - IJCAI2022 Competition

Technical Approach

Our solution applied Deep Residual Networks to Mahjong strategy learning:

Architecture: Implemented CNNs based on ResNet, Res2Net, and DenseNet architectures
Learning Method: Used supervised learning algorithms for feature extraction and strategy learning
Training Data: Leveraged large amounts of expert game-play data
Performance: Demonstrated consistent improvement and competitive ranking

Recognition

Invited to present at the IJCAI2023 special track
Recognized among top performers in international AI gaming competition
Contributed to advancing AI applications in complex strategic games

This achievement demonstrates proficiency in applying state-of-the-art deep learning techniques to complex strategic decision-making problems and competitive performance in international AI competitions.

Possible Models of Self-Awareness in Conscious Turing Machines

Sun, 15 Jan 2023 14:00:00 +0000

This invited talk presented our research on consciousness and self-awareness models within the Conscious Turing Machine (CTM) framework. We explored fundamental questions about machine consciousness and proposed the “MIT” model to understand consciousness generation phenomena in computational systems.

The presentation covered:

Clear definitions of consciousness and self-consciousness in CTM
Analysis of how the CTM framework aligns with human consciousness theories
The “MIT” model for understanding(or possibly explaining) illusion and disorder phenomena
Implications for artificial general intelligence development

This work contributes to the theoretical foundations of machine consciousness and provides insights into the computational basis of awareness.

Computational Robustness of Tall Pyramidal Cells

Thu, 01 Sep 2022 00:00:00 +0000

Conducted independent exploration on the robustness of L5PC neuronal cells’ response to noise of different distributions. Proposed new perspectives on the differences between biological intelligence and mainstream artificial intelligence paradigms. Repository: https://github.com/spidermonk7/neuronal-robustness

Heliox - A GPU-Native framework for the simulation and training of biophsically detailed nueral networks

Thu, 01 Sep 2022 00:00:00 +0000

Based on the framework of Conscious Turing Machines (CTM), discussed consciousness and self-consciousness of a CTM with clear definitions. Examined whether the CTM framework aligns with key definitions of human consciousness and created the ‘MIT’ model to understand consciousness generation phenomena. Invited to present at IJTCS2023 workshop. Repository: https://github.com/spidermonk7/conscious-turing-machine

Research Gym

Thu, 01 Sep 2022 00:00:00 +0000

Self-Awareness in Conscious Turing Machines

Thu, 01 Sep 2022 00:00:00 +0000