高效解析反應式緩衝區串流

工程 | Arjen Poutsma | 2021 年 9 月 14 日 | ...

自 Spring Framework 5.3 發佈以來已有一段時間。該版本的功能之一是對我們的反應式 Multipart 支援進行了重大改進。在這篇部落格文章中，我們將分享在開發此功能時學到的一些知識。具體來說，我們將重點放在如何在位元組緩衝區串流中尋找權杖 (token)。

Multipart 表單資料

每當您上傳檔案時，您的瀏覽器會將其 — 以及表單中的其他欄位 — 以 multipart/form-data 訊息的形式傳送到伺服器。這些訊息的確切格式在 RFC 7578 中有描述。如果您提交一個簡單的表單，其中包含一個名為 foo 的文字欄位和一個名為 file 的檔案選擇器，則 multipart/form-data 訊息看起來會像這樣

POST / HTTP/1.1
Host: example.com
Content-Type: multipart/form-data;boundary="boundary" (1)

--boundary (2)
Content-Disposition: form-data; name="foo" (3)

bar
--boundary (4)
Content-Disposition: form-data; name="file"; filename="lorum.txt" (5)
Content-Type: text/plain

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer iaculis metus id vestibulum nullam.

--boundary-- (6)

訊息的 Content-Type 標頭包含 boundary 參數。
boundary 用於啟動第一個部分。它前面會加上 --。
第一個部分包含文字欄位 foo 的值，如部分標頭中所見。欄位的值為 bar。
boundary 用於分隔第一個和第二個部分。同樣地，它前面會加上 --。
第二個部分包含提交的檔案內容，名為 lorum.txt。
訊息的結尾以 boundary 表示。它的前面和後面都加上 --。

尋找 Boundaries

multipart/form-data 訊息中的 boundary 非常重要。它被指定為 Content-Type 標頭的參數。當 boundary 前面加上兩個連字符號 (--) 時，表示新部分的開始。當 boundary 後面也加上 -- 時，表示訊息的結尾。

在傳入的位元組緩衝區串流中尋找 boundary 是解析 multipart 訊息的關鍵。這樣做似乎很簡單

private int indexOf(DataBuffer source, byte[] target) {
  int max = source.readableByteCount() - target.length + 1;
  for (int i = 0; i < max; i++) {
    boolean found = true;
    for (int j = 0; j < target.length; j++) {
      if (source.getByte(i + j) != target[j]) {
        found = false;
        break;
      }
    }
    if (found) {
      return i;
    }
  }
  return -1;
}

但是，有一個複雜之處：boundary 可能會跨越兩個緩衝區，而在反應式環境中，這兩個緩衝區可能不會同時到達。例如，給定前面顯示的範例 multipart 訊息，第一個緩衝區可能包含以下內容

POST / HTTP/1.1
Host: example.com
Content-Type: multipart/form-data;boundary="boundary"

--boundary
Content-Disposition: form-data; name="foo"

bar
--bou

而下一個緩衝區包含剩餘部分

ndary
Content-Disposition: form-data; name="file"; filename="lorum.txt"
Content-Type: text/plain

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer iaculis metus id vestibulum nullam.

--boundary--

如果我們一次檢查一個緩衝區，我們就找不到像這樣的分割 boundary。相反地，我們需要跨多個緩衝區尋找 boundary。

解決此問題的一種方法是等待接收到所有緩衝區，將它們連接起來，然後在之後定位 boundaries。以下範例使用範例串流和先前定義的 indexOf 方法來執行此操作

Flux<DataBuffer> stream = Flux.just("foo", "bar", "--boun", "dary", "baz")
  .map(s -> factory.wrap(s.getBytes(UTF_8)));
byte[] boundary = "--boundary".getBytes(UTF_8);

Mono<Integer> result = DataBufferUtils.join(stream)
  .map(joined -> indexOf(joined, boundary));

StepVerifier.create(result)
  .expectNext(6)
  .verifyComplete();

使用 Reactor 的 StepVerifier，我們看到 boundary 從索引 6 開始。

這種方法有一個主要的缺點：將多個緩衝區連接成一個有效地將整個 multipart 訊息儲存在記憶體中。Multipart 訊息主要用於上傳（大型）檔案，因此這不是一個可行的選擇。相反地，我們需要一種更聰明的方法來定位 boundary。

Knuth 來救援！

幸運的是，存在這樣一種方法，即 Knuth-Morris-Pratt 演算法。此演算法背後的主要思想是，如果我們已經比對了 boundary 的幾個位元組，但下一個位元組不匹配，我們不需要從頭重新開始。為此，該演算法維護狀態，以預先計算的表格中的位置形式，該表格包含在不匹配後可以跳過的位元組數。

在 Spring Framework 中，我們在 Matcher 介面中實作了 Knuth-Morris-Pratt 演算法，您可以透過 DataBufferUtils::matcher 取得其執行個體。您也可以查看原始碼。

在這裡，我們使用 Matcher 來提供 stream 中 boundary 的結束索引，使用與先前相同的範例輸入

Flux<DataBuffer> stream = Flux.just("foo", "bar", "--boun", "dary", "baz")
  .map(s -> factory.wrap(s.getBytes(UTF_8)));
byte[] boundary = "--boundary".getBytes(UTF_8);

DataBufferUtils.Matcher matcher = DataBufferUtils.matcher(boundary);
Flux<Integer> result = stream.map(matcher::match);

StepVerifier.create(result)
  .expectNext(-1)
  .expectNext(-1)
  .expectNext(-1)
  .expectNext(3)
  .expectNext(-1)
  .verifyComplete();

請注意，Knuth-Morris-Pratt 演算法給出 boundary 的結束索引，這解釋了測試結果：boundary 直到倒數第二個緩衝區中的索引 3 才結束。

正如預期的那樣，Spring Framework 的 MultipartParser 大量使用 Matcher，用於

尋找第一個 boundary，方法是尋找以 -- 為前綴的 boundary。
尋找後續的 boundaries，方法是尋找以 CRLF（前一個部分的結尾）和 -- 為前綴的 boundary。
尋找部分標頭和部分主體之間的 CRLF CRLF 分隔符號。

如果您需要在位元組緩衝區串流中尋找一系列位元組，請試試看 Matcher！

Spring 部落格

高效解析反應式緩衝區串流

Multipart 表單資料

尋找 Boundaries

Knuth 來救援！

取得 Spring 電子報

領先一步

取得支援

即將到來的活動