readerwriterqueue 一个用 C++ 实现的快速无锁队列

https://www.oschina.net/translate/a-fast-lock-free-queue-for-cpp?cmp&p=2

A single-producer, single-consumer lock-free queue for C++

如果没有可靠的（已被测试的）实现，设计又有什么用呢？:-)

我已经在GitHub发布了我的实现。自由的fork它吧！它由两个头部组成，一个是给队列的，还有一个取决于是否包含一些辅助参数。

它具有几个优异的特性：

与 C++11兼容 (支持移动对象而不是做拷贝)
完全通用 (任何类型的模板化容器) — 就像std::queue,你从不需要自己给元素分配内存 (这将你从为了管理正在排队的元素而去写锁无关的内存管理单元的麻烦中解脱出来)
以连续的块预先分配内存
提供 atry_enqueue方法，该方法保证不去分配内存 (队列以初始容量起动)
也提供了一个enqueue方法，该方法能够根据需要动态的增长队列的大小
不采用比较-交换循环；这意味着 enqueue和dequeue是O(1)复杂度 (不计算内存分配)
对于x86设备, 内存屏障编译为空指令，这意味着enqueue与dequeue仅仅只是简单的loads和stores序列 (以及 branches)
在 MSVC2010+ 和 GCC 4.7+下编译 (而且应该工作于任何支持 C++11 的编译器)

It should be noted that this code will only work on CPUs that treat aligned integer and native-pointer-size loads/stores atomically; fortunately, this includes every modern processor (including ARM, x86/x86-64, and PowerPC). It is not designed to work on the DEC Alpha (which seems to have the weakest memory-ordering guarantees of all time).

I’m releasing the code and algorithm under the terms of the simplified BSD license. Use it at your own risk; in particular, lock-free programming is a patent minefield, and this code may very well violate a pending patent (I haven’t looked). It’s worth noting that I came up with the algorithm and implementation from scratch, independent of any existing lock-free queues.

译者信息

赵亮-碧海情天

翻译于 2013/05/30 17:56

1 人 BSD授权协议。你需要自己承担使用风险；特别是，无锁编程是一个专利的雷区，这代码很可能违反了专利（我还没查验）。需要提出的是，我是自己胡乱写出来的算法和实现，与任何现有的无锁队列无关。

Performance and correctness

In addition to agonizing over the design for quite some time, I tested the algorithm using several billion randomized operations in a simple stability test (on x86). This, of course, helps inspire confidence, but proves nothing about the correctness. In order to ensure it was correct, I also tested using Relacy, which ran all the possible interleavings for a simple test which turned up no errors; it turns out this simple test wasn’t comprehensive, however, since I eventually did find a bug later using a different set of randomized runs (which I then fixed).

I’ve only tested this queue on x86-64, which is rather forgiving as memory ordering goes. If somebody is willing to test this code on another architecture, let me know! The quick stability test I whipped up is available here.

译者信息

MtrS

翻译于 2013/05/31 11:59

1 人这儿。

In terms of performance, it’s fast. Really fast. In my tests, I was able to get up to about 12+ million concurrent enqueue/dequeue pairs per second! (The dequeue thread had to wait for the enqueue thread to catch up if there was nothing in the queue.) After I had implemented my queue, though, I found another single-consumer, single-producer templated queue (written by the author of Relacy) published on Intel’s website; his queue is roughly twice as fast, though it doesn’t have all the features that mine does, and his only works on x86 (and, at this scale, “twice as fast” means the difference in enqueue/dequeue time is in the nanosecond range).

Update16 days ago

I spent some time properly benchmarking, profiling, and optimizing the code, using Dmitry’s single-producer, single-consumer lock-free queue (published on Intel’s website) as a baseline for comparison. Mine’s now faster in general, particularly when it comes to enqueueing many elements (mine uses a contiguous block instead of separate linked elements). Note that different compilers give different results, and even the same compiler on different hardware yields significant speed variations. The 64-bit version is generally faster than the 32-bit one, and for some reason my queue is much faster under GCC on a Linode. Here are the benchmark results in full:

32-bit, MSVC2010, on AMD C-50 @ 1GHz
------------------------------------
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add           | 0.0039s | 0.0268s | 0.0040s | 0.0271s | 0.0040s | 0.0270s | 6.8x
Raw remove        | 0.0015s | 0.0017s | 0.0015s | 0.0018s | 0.0015s | 0.0017s | 1.2x
Raw empty remove  | 0.0048s | 0.0027s | 0.0049s | 0.0027s | 0.0048s | 0.0027s | 0.6x
Single-threaded   | 0.0181s | 0.0172s | 0.0183s | 0.0173s | 0.0182s | 0.0173s | 0.9x
Mostly add        | 0.0243s | 0.0326s | 0.0245s | 0.0329s | 0.0244s | 0.0327s | 1.3x
Mostly remove     | 0.0240s | 0.0274s | 0.0242s | 0.0277s | 0.0241s | 0.0276s | 1.1x
Heavy concurrent  | 0.0164s | 0.0309s | 0.0349s | 0.0352s | 0.0236s | 0.0334s | 1.4x
Random concurrent | 0.1488s | 0.1509s | 0.1500s | 0.1522s | 0.1496s | 0.1517s | 1.0xAverage ops/s:
    ReaderWriterQueue: 23.45 million
    SPSC queue:        28.10 million64-bit, MSVC2010, on AMD C-50 @ 1GHz
------------------------------------
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add           | 0.0022s | 0.0210s | 0.0022s | 0.0211s | 0.0022s | 0.0211s | 9.6x
Raw remove        | 0.0011s | 0.0022s | 0.0011s | 0.0023s | 0.0011s | 0.0022s | 2.0x
Raw empty remove  | 0.0039s | 0.0024s | 0.0039s | 0.0024s | 0.0039s | 0.0024s | 0.6x
Single-threaded   | 0.0060s | 0.0054s | 0.0061s | 0.0054s | 0.0061s | 0.0054s | 0.9x
Mostly add        | 0.0080s | 0.0259s | 0.0081s | 0.0263s | 0.0080s | 0.0261s | 3.3x
Mostly remove     | 0.0092s | 0.0109s | 0.0093s | 0.0110s | 0.0093s | 0.0109s | 1.2x
Heavy concurrent  | 0.0150s | 0.0175s | 0.0181s | 0.0200s | 0.0165s | 0.0190s | 1.2x
Random concurrent | 0.0367s | 0.0349s | 0.0369s | 0.0352s | 0.0368s | 0.0350s | 1.0xAverage ops/s:
    ReaderWriterQueue: 34.90 million
    SPSC queue:        32.50 million32-bit, MSVC2010, on Intel Core 2 Duo T6500 @ 2.1GHz
----------------------------------------------------
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add           | 0.0011s | 0.0097s | 0.0011s | 0.0099s | 0.0011s | 0.0098s | 9.2x
Raw remove        | 0.0005s | 0.0006s | 0.0005s | 0.0006s | 0.0005s | 0.0006s | 1.1x
Raw empty remove  | 0.0018s | 0.0011s | 0.0019s | 0.0011s | 0.0018s | 0.0011s | 0.6x
Single-threaded   | 0.0047s | 0.0040s | 0.0047s | 0.0040s | 0.0047s | 0.0040s | 0.9x
Mostly add        | 0.0052s | 0.0114s | 0.0053s | 0.0116s | 0.0053s | 0.0115s | 2.2x
Mostly remove     | 0.0055s | 0.0067s | 0.0056s | 0.0068s | 0.0055s | 0.0068s | 1.2x
Heavy concurrent  | 0.0044s | 0.0089s | 0.0075s | 0.0128s | 0.0066s | 0.0107s | 1.6x
Random concurrent | 0.0294s | 0.0306s | 0.0295s | 0.0312s | 0.0294s | 0.0310s | 1.1xAverage ops/s:
    ReaderWriterQueue: 71.18 million
    SPSC queue:        61.02 million64-bit, MSVC2010, on Intel Core 2 Duo T6500 @ 2.1GHz
----------------------------------------------------
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add           | 0.0007s | 0.0097s | 0.0007s | 0.0100s | 0.0007s | 0.0099s | 13.6x
Raw remove        | 0.0004s | 0.0015s | 0.0004s | 0.0020s | 0.0004s | 0.0018s | 4.6x
Raw empty remove  | 0.0014s | 0.0010s | 0.0014s | 0.0010s | 0.0014s | 0.0010s | 0.7x
Single-threaded   | 0.0024s | 0.0022s | 0.0024s | 0.0022s | 0.0024s | 0.0022s | 0.9x
Mostly add        | 0.0031s | 0.0112s | 0.0031s | 0.0115s | 0.0031s | 0.0114s | 3.7x
Mostly remove     | 0.0033s | 0.0041s | 0.0033s | 0.0041s | 0.0033s | 0.0041s | 1.2x
Heavy concurrent  | 0.0042s | 0.0035s | 0.0067s | 0.0039s | 0.0054s | 0.0038s | 0.7x
Random concurrent | 0.0142s | 0.0141s | 0.0145s | 0.0144s | 0.0143s | 0.0142s | 1.0xAverage ops/s:
    ReaderWriterQueue: 101.21 million
    SPSC queue:        71.42 million32-bit, Intel ICC 13, on Intel Core 2 Duo T6500 @ 2.1GHz
--------------------------------------------------------
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add           | 0.0014s | 0.0095s | 0.0014s | 0.0097s | 0.0014s | 0.0096s | 6.8x
Raw remove        | 0.0007s | 0.0006s | 0.0007s | 0.0007s | 0.0007s | 0.0006s | 0.9x
Raw empty remove  | 0.0028s | 0.0013s | 0.0028s | 0.0018s | 0.0028s | 0.0015s | 0.5x
Single-threaded   | 0.0039s | 0.0033s | 0.0039s | 0.0033s | 0.0039s | 0.0033s | 0.8x
Mostly add        | 0.0049s | 0.0113s | 0.0050s | 0.0116s | 0.0050s | 0.0115s | 2.3x
Mostly remove     | 0.0051s | 0.0061s | 0.0051s | 0.0062s | 0.0051s | 0.0061s | 1.2x
Heavy concurrent  | 0.0066s | 0.0036s | 0.0084s | 0.0039s | 0.0076s | 0.0038s | 0.5x
Random concurrent | 0.0291s | 0.0282s | 0.0294s | 0.0287s | 0.0292s | 0.0286s | 1.0xAverage ops/s:
    ReaderWriterQueue: 55.65 million
    SPSC queue:        63.72 million64-bit, Intel ICC 13, on Intel Core 2 Duo T6500 @ 2.1GHz
--------------------------------------------------------
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add           | 0.0010s | 0.0099s | 0.0010s | 0.0100s | 0.0010s | 0.0099s | 9.8x
Raw remove        | 0.0006s | 0.0015s | 0.0006s | 0.0018s | 0.0006s | 0.0017s | 2.7x
Raw empty remove  | 0.0024s | 0.0016s | 0.0024s | 0.0016s | 0.0024s | 0.0016s | 0.7x
Single-threaded   | 0.0026s | 0.0023s | 0.0026s | 0.0023s | 0.0026s | 0.0023s | 0.9x
Mostly add        | 0.0032s | 0.0114s | 0.0032s | 0.0118s | 0.0032s | 0.0116s | 3.6x
Mostly remove     | 0.0037s | 0.0042s | 0.0037s | 0.0044s | 0.0037s | 0.0044s | 1.2x
Heavy concurrent  | 0.0060s | 0.0092s | 0.0088s | 0.0096s | 0.0077s | 0.0095s | 1.2x
Random concurrent | 0.0168s | 0.0166s | 0.0168s | 0.0168s | 0.0168s | 0.0167s | 1.0xAverage ops/s:
    ReaderWriterQueue: 68.45 million
    SPSC queue:        50.75 million64-bit, GCC 4.7.2, on Linode 1GB virtual machine (Intel Xeon L5520 @ 2.27GHz)
-----------------------------------------------------------------------------
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
------------------+---------+---------+---------+---------+---------+---------+------
Raw add           | 0.0004s | 0.0055s | 0.0005s | 0.0055s | 0.0005s | 0.0055s | 12.1x
Raw remove        | 0.0004s | 0.0030s | 0.0004s | 0.0030s | 0.0004s | 0.0030s | 8.4x
Raw empty remove  | 0.0009s | 0.0060s | 0.0010s | 0.0061s | 0.0009s | 0.0060s | 6.4x
Single-threaded   | 0.0034s | 0.0052s | 0.0034s | 0.0052s | 0.0034s | 0.0052s | 1.5x
Mostly add        | 0.0042s | 0.0096s | 0.0042s | 0.0106s | 0.0042s | 0.0103s | 2.5x
Mostly remove     | 0.0042s | 0.0057s | 0.0042s | 0.0058s | 0.0042s | 0.0058s | 1.4x
Heavy concurrent  | 0.0030s | 0.0164s | 0.0036s | 0.0216s | 0.0032s | 0.0188s | 5.8x
Random concurrent | 0.0256s | 0.0282s | 0.0257s | 0.0290s | 0.0257s | 0.0287s | 1.1xAverage ops/s:
    ReaderWriterQueue: 137.88 million
    SPSC queue:        24.34 million

In short, my queue is blazingly fast, and actually doing anything with it will eclipse the overhead of the data structure itself.

The benchmarking code is available here (compile and run with full optimizations).

译者信息

Performance and correctness

译者信息

个人收藏笔记记录

开通VIP