TL:DR
The history of parallel computing in computers can be broadly categorized into three levels: instruction-level parallelism, data-level parallelism, and task-level parallelism.
The first level of parallelism is instruction-level parallelism (ILP). ILP was the primary means of improving performance in computer architectures during the last 20 years of the 20th century. Programmers particularly appreciate ILP because it can enhance performance while maintaining binary compatibility. There are two types of ILP: temporal parallelism, realized through instruction pipelining, and spatial parallelism, achieved through multiple issue or superscalar execution. Pipelining is analogous to an automobile assembly line, where multiple cars are produced simultaneously across different stages. Multiple issue allows for out-of-order execution, similar to multiple lanes on a highway. After the introduction of RISC architectures in the 1980s, ILP development reached a peak in the following two decades, but further improvements have been limited since 2010.
The second level of parallelism is data-level parallelism (DLP), primarily referring to single instruction, multiple data (SIMD) vector architectures. Early DLP appeared in ENIAC, and vector machines like Cray-1 and Cray-2 were popular in the 1960s and 1970s. After Cray-4, SIMD went through a period of dormancy but has recently regained momentum, with increasing adoption. For example, AVX instructions in x86 can perform four 64-bit or eight 32-bit operations using a 256-bit datapath. SIMD has played a crucial role as a complement to ILP, especially in streaming media applications. While initially used in specialized processors, SIMD is now a standard feature in general-purpose processors.
Task-Level Parallelism
The third level, task-level parallelism, is prevalent in Internet applications. Task-level parallelism is represented by multi-core processors and multi-threaded processors, which are the primary methods for enhancing computer architecture performance. Task-level parallelism has a larger grain size, with a single thread containing hundreds or more instructions. From the perspective of the development of parallel computing, the current state of blockchain technology is in the process of transitioning from the first level to the second level. Mainstream blockchain systems typically employ either a single chain or a multi-chain architecture. Single-chain systems, such as Bitcoin and Ethereum, have a single chain with each node executing the same smart contract transactions, maintaining a consistent chain state. Each blockchain node typically executes smart contract transactions serially, resulting in low throughput. Recent high-performance blockchain systems, although employing a single-chain architecture, also support parallel execution of smart contract transactions. Thomas Dickerson and Maurice Herlihy from Brown University and Yale University, respectively, first proposed a parallel execution model based on Software Transactional Memory (STM) in their 2017 PODC paper. This model utilizes optimistic parallelism to execute multiple transactions in parallel, detecting and rolling back any conflicts that arise during execution. This approach has been applied to several high-performance blockchain projects, including Aptos, Sei, and Monad. In contrast, another parallel execution model is based on pessimistic concurrency, where transactions are executed in parallel only after detecting that they do not conflict with each other. This approach typically employs pre-computation, using program analysis tools to statically analyze smart contract code and build dependency graphs. When concurrent transactions are submitted to the system, the system determines whether transactions can be executed in parallel based on the dependencies between the states they access. Only transactions without dependencies can be executed in parallel. This approach has been applied to high-performance blockchain projects such as Zilliqa (CoSplit version) and Sui. Both models can significantly enhance system throughput. However, these works face two challenges: scalability and parallel semantic expression, which will be discussed in detail below.