Glossary Terms

Term Definition
3DNow! SIMD instruction set developed by AMD. It enables vector processing to improve the performance of graphic intensive operations. 3DNow! was AMD's response to Intel's MMX.
adaptive mesh refinement Any meshing technique in which a finer grid is overlaid, perhaps recursively, on regions of a coarse grid. These regions are identified to have insufficient grid resolution according to some criteria, e.g., there are steep gradients in the data there. As the main calculation proceeds, a correction may be applied along the coarse-fine grid interfaces in order to preserve the known conservation laws of the system. Often abbreviated AMR.
affinity (UPC) In the Unified Parallel C language, the association between a data item and the UPC thread that "owns" it. Assuming an ideal UPC implementation, affinity means that the data item is stored into the physical memory of the same CPU where the UPC thread is running.
alias A feature in many command-line interpreters or programming languages, that allows a command or group of commands to be replaced by another (usually shorter) string or avoids naming collisions
allocation An allotment of computing time, disk space, or other types of computing resources, usually granted to the principal investigator (PI), on behalf of the research group that the PI leads. The allocation is typically awarded after peer review and approval of the PI's allocation application.
alpha-beta model A simple linear model for the time required to send a message over a network. The formula is t=a+n/b, where a (alpha) is the latency, and b (beta) is 1/(bandwidth).
Amdahl's Law The observation that given a fixed problem size, the potential speedup of a parallel program is limited by the fraction of the calculation that must be carried out in serial. If the serial fraction is (1-P), so that P is the fraction of the work that can be done in parallel, then the maximum possible speedup is S = 1/((1-P) + P/N) where N is the number of parallel threads of execution. In the limit of large N, S -> 1/(1-P).
API Application Programming Interface, the interface used by software components to communicate with each other. For example, when a main program calls a subprogram, the calling sequence for the subprogram is the API.
arithmetic logic unit (ALU) A fundamental component of CPUs and GPUs that performs arithmetic and logical operations.
array An indexed collection of variables.
ATLAS Automatically Tuned Linear Algebra Software, a software library for basic numerical linear algebra. It provides a mature, open-source implementation of the BLAS APIs for C and Fortran77. ATLAS is often recommended as a way to automatically generate an optimized BLAS library. Its performance can rival that of specialized libraries for one specific hardware platform, and ATLAS is significantly faster than the generic BLAS available at Netlib.
atomic operation (atomicity) An operation that appears to be indivisible from the point of view of the rest of the system. This prevents other processors or I/O devices from accessing the referenced memory until the atomic operation completes. Atomicity is a guarantee of isolation from concurrent processes, preventing race conditions. (In OpenMP, the "atomic" clause applies only to memory updates by the current thread team.)
automatic speech recognition The translation of spoken words into text via a computer algorithm such as a Hidden Markov Model.
AVX Advanced Vector Extensions. An enhancement to Intel's x86 instruction set, based on the earlier SSE instructions for 128-bit registers, that adds instructions to operate with 256-bit (AVX) registers. AVX also introduces a 3-operand SIMD instruction format such that the destination register is distinct from the two source operands.
AVX-512 Advanced Vector Extensions, 512-bit. Intel's enhancement to the x86 instruction set to expand to 512 bits the earlier AVX and AVX2 instructions for 256-bit vectors, which in turn grew from Intel's SSE instructions for 128-bit vectors.
bandwidth A measure of the sustained rate of information transfer, typically used to quantify the communication capability of computer networks and components. Bandwidth can be used to measure both node-to-node and intra-node communication speeds. It is usually measured in megabytes of data per second (disk drives) or in gigabits per second (networks).
Barnes-Hut algorithm A tree-based algorithm for computing far-field forces on a particle system. The 3D simulation volume is divided up into cubic cells via an octree (or in 2D, into quadrants via a quad-tree), so that only particles from nearby cells need to be treated individually, while particles in distant cells can be treated as one large particle. As compared to a direct-sum algorithm which would be O(N^2), Barnes-Hut requires only O(N log N) operations.
barrier An instruction in the source code that creates a synchronization point. At such a point, any process or thread must stop, proceeding only when all processes/threads have reached the same point.
BBN Butterfly A massively parallel computer built by Bolt, Beranek, and Newman, using the "butterfly" multi-stage switching network. Its memory address space was shared, that is, each connected CPU was allowed access to every other CPU's memory.
Bell's Law An explanation of the evolution of computers in terms of computer classes, which are composed of hardware components along with associated operating systems, application programs, and unique content. The main principles of Bell's Law are that a computer class generally evolves to deliver increasing performance at a constant price; that an established computer class can be entirely subsumed by a more rapidly evolving, powerful, less expensive class; and that approximately every decade, a new computer class arises when a distinct "minimal" computer is built from a fragment of the current chip technology.
benchmark A standardized program (or suite of programs) that is run to measure the performance of one computer against the performance of other computers running the same program.
bidirectional queue In a computer connected to a network, a queue structure that is capable of holding both inbound messages that are to be received locally, and outbound messages that are to be sent to a remote destination.
big ball of mud "A haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems." [quote from Brian Foote and Joseph Yoder, in their 1997 paper of the same name]
big O, O( ) Mathematical notation describing the asymptotic behavior of a function as its argument tends to infinity. In conjunction with its argument, it is pronounced "order", e.g., O(N) is pronounced "order N". In computer science, big O notation is used to describe the asymptotic complexity of an algorithm, i.e., the resources required, as a function of problem size. For example, an algorithm whose run time is O(N) will take time proportional to N for large N; an algorithm whose run time is O(1) takes an approximately constant amount of time independent of problem size.
big Omega Mathematical notation meaning lower bound on the order of magnitude with respect to some base. Pronounced "no less than order" or "order larger than".
bisection bandwidth The aggregate bandwidth across the smallest "cut" that divides an entire network into two equal halves. It is an important network metric for algorithms that require all-to-all communications.
bit-level parallelism Parallel computing based solely on hardware characteristics. It can be increased by increasing the number of bits that can be computed upon simultaneously by a microprocessor.
blackboard pattern In software architecture, a structural pattern in which a common knowledge base, the "blackboard" or repository, is updated by collaborating agents until a solution is found. In OPL, it is also called the "agent and repository" pattern. Here, a repository consists of a data store plus a data manager to coordinate access to the data store.
BLACS Basic Linear Algebra Communication Subprograms, a linear-algebra-oriented message passing interface. It aims to be both portable and efficient on a wide variety of distributed memory parallel computers.
BLAS Basic Linear Algebra Subprograms, the standard interface for elementary routines in numerical linear algebra. BLAS Levels 1, 2, and 3 refer to vector-vector, matrix-vector and matrix-matrix operations, respectively. Level M also refers to the scaling exponent of naïve algorithms; i.e., at Level M, O(N^M) operations are presumably needed for N-vectors or NxN matrices. Vendors and others have provided optimized implementations that can reduce the scaling exponent to 2 for Level 3 routines, e.g. A few well-known implementations are ATLAS, Intel MKL, and the standard reference code, Netlib BLAS, which was written in Fortran 77.
blocking/non-blocking communication Two ways of handling point-to-point communication. Blocking methods suspend the calling program's execution until the message in the buffer is ready to use. Non-blocking methods start the communication but do not suspend the calling program for the entire communication duration. Non-blocking methods must therefore be completed by a matching "wait" (or similar) call at a later point in the calling program.
branch An if-then-else structure in a program. At a branch point, control flow proceeds in one of two directions, depending on whether a given condition is met.
breadth-first search A strategy for searching a graph that begins at a root node and inspects all neighboring nodes, then inspects the neighbors of the neighbors, and so on. Compare with "depth-first search".
broadcast An operation takes data from one process and sends it to all processes in the process group.
buffer A region of memory used to hold temporary data when it is being moved to a different place. In MPI, it can also refer to the sole region of memory that holds non-temporary data during communication.
bulk synchronous parallel (BSP) A model of a parallel computer than takes into account the overhead due to communication and synchronization. A hypothetical upper bound on the overhead is calculated based on the assumptions that (1) the parallel processes must synchronize after a bulk communication step, during which (2) all messages overlap with each other.
burst buffer A high-speed storage layer situated between an application and a persistent file system that quickly absorbs any bursts of output from the application and gradually drains them into the persistent file system. This allows the application to return to computation sooner. A typical burst buffer is built from non-volatile RAM components such as flash memory.
bus Computer subsystem that moves data between the components in the system.
byte A unit of digital information, consisting of eight bits.
cache Fast memory used to hold data which has recently been fetched from the slower and larger main computer memory. There are generally several levels of cache called L1, L2, and L3, where L1 is the fastest (and smallest) level in the cache hierarchy. Each time the processor performs a fetch operation, the hardware checks first to see if the required data are already present in cache at some level, checking L1 first, then L2, etc. If the required data are not in any of the cache levels, a cache miss results, and the data must be retrieved from main memory at the expense of time.
cache coherence The consistency of shared operands in a shared memory system. In a typical shared memory architecture, separate cache memory is maintained for each processor or core. Thus it is possible to have multiple copies of a shared operand stored in the main memory and in each cache memory. When one copy of the operand is changed, cache coherence ensures that this change is propagated promptly to the other copies throughout the system, before any other changes take place.
cache hit The opposite of a cache miss. A cache hit means that the requested data are present in cache memory, so there is no need to retrieve the data from main memory, which would be much slower.
cache line The basic unit of contiguous memory that is fetched from main memory and handled by the various levels of cache. The typical length of a cache line is 64 or 128 bytes.
cache-oblivious algorithm An algorithm designed to take advantage of a processor’s cache without knowing in advance the size of the cache (or the length of the cache lines, etc.). Thus, it is designed to perform well, without modification, on various memory hierarchies having different levels of cache with different sizes. Typically, a cache-oblivious algorithm works by a recursive divide and conquer algorithm, where the problem is divided into smaller and smaller sub-problems. Eventually, a sub-problem size is reached that will fit into cache, regardless of the cache size. Examples of libraries that exploit cache-oblivious algorithms are FFTW and ATLAS.
CAD-CAM Computer-Aided Design and Computer-Aided Manufacturing. The use of computer systems and software to assist in creating a design or model and (possibly) fabricating it.
capacitance The electric charge that can be stored per unit of voltage. It increases as the fixed distance between two oppositely charged objects decreases. Power consumption in a computer chip scales as (C V^2 F), where C is capacitance, V is voltage, and F is cycles per second.
checkpointing Periodically storing a snapshot of the current application state in case the application needs to be restarted later, e.g., after an interruption.
chip-level multiprocessing (CMP) A special case of symmetric multiprocessing (SMP) in which the multiple CPUs or cores are on a single chip.
Cilk C-like programming language designed for multithreaded parallel computing, developed at MIT.
clause (OpenMP) An instruction appended to an OpenMP directive that allows one to express a special attribute, such as the scope of variables within a parallel construct.
clock speed The frequency of a microprocessor in cycles per second (Hz). Boosting the clock speed yields more operations per second, but it also drives up the power consumption of a chip. A clock cycle takes time given by 1 over the frequency in Hz.
cloud computing The use of computer hardware and software delivered as a service over the Internet. In diagrams, a cloud-shaped symbol is often used to represent the network connnecting the user to the service.
cluster An architecture consisting of a networked set of nodes functioning as a single resource.
Coarray Fortran An extension of Fortran 95/2003 for parallel processing. (Abbreviation: CAF.)
collective operation A parallel processing step that requires the participation of all processes within a defined group of processes, such as an MPI communicator. The operation can involve communication only, or it can combine communication with computation (e.g., MPI_Reduce).
communication overhead A measure of the additional workload incurred in a concurrent algorithm due to communication between the nodes of the parallel system. If communication is the only source of overhead, then the communication overhead is given by: ((number of processors * parallel run time) - sequential run time) / sequential run time
compiler A computer program which acts on source code and produces code in a different format, particularly machine code specific to the hardware on which the compiled program will run.
compiler flags Guidance given to the compiler about how to compile the code, which may be aimed at optimization of the output code, at code-checking and delivery of warnings, directions to find libraries, etc.
compressed sparse row (CSR) Data format for representing a sparse matrix compactly by storing only the nonzero values and two vectors of indices that describe how to reconstruct the entire matrix. Also called compressed row storage (CRS).
computational intensity The number of floating point operations per slow-to-fast memory access. A typical definition is the formula q=f/m, where f stands for total number of floating point operations, and m stands for total amount of memory traffic between cache and RAM. This is not a unique definition; it can vary slightly by context. It is often called "arithmetic intensity" (e.g., in GPUs).
concurrency A property of computer systems in which multiple computations (threads, processes, or both) can potentially proceed simultaneously. These computations may interact with each other. Concurrency does not imply that multiple cores are involved; a single core may be time-shared by a concurrent program.
condition variable A memory location that is used for signaling between threads. In Pthreads, a change in the state of the condition variable can tell a sleeping thread to wake up.
conjugate gradient An algorithm for the numerical solution of systems of linear equations that meet certain criteria. It is implemented most often as an iterative method based on the convergence of a residual vector.
conservative parallelization An approach to parallel computing in which a parallel task computes its results only after all inputs have been received. Deadlock detection is usually necessary.
constellation A computer architecture in which there are more microprocessors within a node than there are nodes in the commodity cluster. (Ref. - Dongarra et al., "High Performance Computing Clusters, Constellations, MPPs, and Future Directions", 2003.)
container A form of virtualization where the host OS kernel is shared among various user space environments that are typically completely isolated from each other. These isolated environments are often called containers, but may also be called by a name bestowed by a particular container technology (e.g. jails, zones, etc.).
contention A situation that occurs when several processes attempt to access the same resource simultaneously. An example is memory contention in shared memory multiprocessors, which occurs when two processors attempt to read from or write to a location in shared memory in the same clock cycle.
convex hull The smallest convex set that contains a given set of points in a Euclidean space (e.g., a convex polygon in a 2D Euclidean plane).
core A processing unit on a computer chip capable of supporting a thread of execution. Usually "core" refers to a physical CPU (central processing unit) in hardware. However, Intel processors can appear to have 2x or 4x as many cores via "hyperthreading" or "hardware threads".
CPU cycle One tick of the clock generator for a CPU core Depending on the architecture of the core, one or several floating point operations may be completed in each core during a single CPU cycle.
CREW Concurrent Read, Exclusive Write. A parallel memory model in which multiple processors can read simultaneously from a single memory location, but only one processor can write to a particular memory location at one time.
critical path method A scheduling algorithm that calculates the durations of all sequences of dependent tasks that must be completed before the job ends; finds the longest such sequence, which is called the critical path; and tries to arrange the other tasks so they can be completed in parallel, without making the critical path any longer. This determines the shortest possible time to complete the job.
critical section A segment of code involving shared resources that should not be accessed by more than one thread concurrently. A synchronization mechanism is required at the entry and exit of the critical section to ensure exclusive use.
crossbar switch A switch connecting multiple inputs to multiple outputs in any-to-any fashion.
CUDA A parallel computing architecture and API created by NVIDIA that enables NVIDIA GPUs to serve as a platform for general-purpose HPC applications. The name originally meant Compute Unified Device Architecture.
dangling pointer A pointer that no longer points to an address with a value of the type specified by the pointer.
data alignment Storing data at a memory offset that is equal to some integer multiple of the data's word length, in order to increase the system's performance. Data alignment may account for other special data dimensions in the system, such as the width of a vector register or the length of a cache line.
data dependency A situation that occurs when there are two or more references to the same memory location, and at least one of them is a write. Parallel programmers must be aware of data dependencies in their programs to ensure that modifications made to data by one processor are communicated to other processors that use the same data.
data locality The proximity of data to the core that is fetching them. For good performance, accessed data should stay near the top of the memory hierarchy (registers, caches) to be used and re-used as much as possible before being replaced. Spatial locality refers to the closeness of memory references in address space; ideally, the addresses are consecutive. Data move up the hierarchy in “cache lines” that include nearby data, so operations (especially vector operations) can step through contiguous data more quickly. Temporal locality refers to how close a memory reference is in time to a previous reference. Closer is again better, as the data are more likely to be re-used before leaving registers or cache, reducing data movement. Data locality or “degree of locality” is a concept that includes both temporal and spatial locality.
data parallel A programming model in which each parallel worker (or task) performs the same work on a unique segment of the data. Either message passing libraries such as MPI or higher-level constructs such as OpenMP can be used for coding with this model. An alternative to data parallelism is functional parallelism.
database A collection of data that is structured to facilitate the retrieval of related items using "queries".
deadlock A situation in which parallel tasks are waiting on an event which will never occur. The simplest version of deadlock arises when blocking writes and reads are incorrectly ordered in a loosely synchronous environment. In MPI, for example, when two tasks execute blocking writes to each other, deadlock will occur since neither write can complete until a complementary read is executed in the other task.
deep learning A set of techniques for machine learning that leverage the power of mutlilayer neural networks to learn about patterns in, and to make predictions about, data.
denormalization In Relational Databases, denormalization is the process of deliberately breaking normalization rules, for example to improve performance or make queries simpler
dense linear algebra Linear algebra operations involving vectors and matrices that contain mostly non-zero values. Computationally, such operations are amenable to performance optimization techniques such as cache tuning (blocking) and vectorization. Often, dense linear algebra is best handled through calls to a library that features optimized implementations of the standard routines of BLAS, LAPACK, and/or ScaLAPACK.
depth-first search A strategy for searching a graph that begins at a root node, selects a branch, and inspects as far as possible along that branch (and its sub-branches) before backtracking. Compare with "breadth-first search".
design pattern In software engineering, a high-level description of the software elements and interactions that are needed to meet a particular design goal, according to a systematized classification of such solutions.
deterministic execution A desirable trait of a message-passing program, meaning that the program will always return the same answer in roughly the same amount of time, regardless of the exact order in which messages are sent or received, or other types of uncertainties that might occur during execution on multiple nodes.
diagonal matrix Matrix whose nonzero entries are found only on the main diagonal.
diameter The greatest distance, measured in hops, between any two nodes in a given network topology.
die A single computer chip or other functional circuit that has been sliced out of a larger rectangular wafer of semiconducting material. In HPC, "die" commonly refers to a microprocessor.
directed acyclic graph (DAG) A directed graph in which there are no cycles, i.e., no way to start at some vertex v and follow a sequence of directed edges back to v.
directive An extension to a programming language specifying how the compiler or assembler should process a given group of statements. Often a directive requires a preprocessor to pass along special instructions to the compiler whenever a certain sentinel appears, e.g., "#pragma omp" for OpenMP extensions.
DirectX A collection of APIs for handling multimedia-related tasks on Microsoft platforms.
discrete event pattern A generalization of pipeline parallelism in which the dependency graph may contain branches and loops that connect the various tasks.
discrete event simulation A model of a system in which the system operates as a sequence of separate events in time, between which no changes occur. The input events do not have to be causally related, e.g., they could be simulated requests coming from customers in different service queues. Statistics from the simulation may help diagnose problems with the service.
discretization The process of converting continuous mathematical models and equations into discrete approximations, suitable for representation on digital computers.
distributed memory Memory that is split up into separate pools, each of which may be directly accessed by only one node of a parallel computer. Distributed memory and shared memory are two major classes of parallel computers that require very different programming styles.
distributed processing Processing on a number of networked computers , each of which has local memory. The computers may or may not be of different relative power and function.
DMA Direct memory access. The ability of hardware subsystems within a computer to access main memory independently of a central processing unit (CPU).
DOI A unique alphanumeric string assigned by a registration agency to identify content and provide a persistent link to its location on the internet.
domain decomposition Separating a sequential problem into concurrent components by dividing its data structure into contiguous substructures, in a manner analogous to dividing a geometric region into subregions. A boundary value problem, for example, may be parallelized by (1) splitting it into smaller boundary value problems on subdomains and (2) having each process or thread coordinate its solution for a subdomain with the adjacent subdomains. This procedure may be iterated if necessary.
double data rate (DDR) Data communication that occurs on both the rising and falling edges of the clock signal, e.g., DDR memory. Also called double-pumped.
DRAM Dynamic Random Access Memory, the type of RAM that is used in the main memory of computers.
dusty-deck Also called "legacy code"; programs written in older paradigms not generally used in more recent software.
dynamic load balancing Attempting to keep all parallel workers busy in real time, while an application is running. One possible strategy is work stealing, where an idle worker will try to take a task from the task queue of a busy neighbor.
dynamic programming In mathematics, a method for constructing an optimized solution to a complex problem by breaking it down into simpler subproblems.
eager limit The message size up to which an MPI sender will transfer the data to a receive-side buffer without waiting for the receiver's acknowledgement. The message is sent using the eager protocol, which means no handshaking is required prior to the communication.
EEMBC EDN (magazine's) Embedded Microprocessor Benchmark Consortium. A group that develops benchmark suites, such as CoreMark and MultiBench, that help designers select the best microprocessor for their embedded systems. The benchmarks were intended to address the shortcomings of Dhrystone. Pronounced "embassy".
effective bandwidth The actual speed at which data are transmitted.
efficiency A measure of the amount of time a parallel program spends doing useful computation as opposed to other work. Efficiency is sometimes defined as speedup / number of processors. The closer it is to 1, the more perfectly parallel the task is at that level of parallelism; the closer to 0, the less parallel.
embarrassing parallelism A characteristic of a computing problem that allows it to be easily separated into many independent tasks. An "embarrassingly parallel" problem tends to require little or no communication between tasks so that only minimal effort is needed to parallelize it.
emulator Similar to a hypervisor, except that in addition to emulating various hardware peripherals, also translates the virtual machine CPU instructions to the host architecture, and is even more frequently implemented in software rather than hardware or firmware.
energy In physics, a definite quantity of work. Energy is equal to power integrated over time, so it is often measured in watt-hours (W-h) or kilojoules (3.6 kJ = 1 W-h). A battery can supply a fixed amount of energy to a computer chip operating at a reference voltage.
Entity Relationship Diagram (ERD) A diagram which models a data set, showing entities, relationships between entities and the attributes of both entities and relationships. Entity Relationship Diagrams are very helpful aids to designing relational databases when the datasets are even moderately complex.
environment variables Names and assigned values that form part of the operating environment in which a shell or a process runs. Environment variables are inherited by child processes unless otherwise specified.
event-driven architecture A software architectural pattern in which agents send out signals to notify other agents about events, or significant changes in state. An event will usually trigger responses by other agents. Agents may be emitters of events, or consumers (sinks) of events, or both. An event manager may also be designated to handle the signal traffic through the communication medium.
exascale computing Computing on systems capable of at least one exaflops, i.e., a quintillion (10^18) floating point operations per second.
explicit parallelism A characteristic of a programming language or API in which the programmer provides special instructions regarding synchronization, communication, or task partitioning in order to achieve parallelism.
Fast Fourier Transform A numerical method for computing the discrete Fourier transform (DFT) or its inverse. The best-known method is the Cooley-Tukey algorithm, which computes the FFT of N evenly-spaced points in O(N log(N)) operations.
fast multipole method (FMM) A mathematical technique that was developed to speed up the calculation of far-field forces in particle systems. FMM was first devised by Greengard and Rokhlin and later named one of the top ten algorithms of the 20th century. When FMM is applied in a hierarchical (tree-like) manner, it can reduce the cost of computing the matrix-vector products in an iterative solver from O(N^2) to O(N).
fat tree A network topology that can be represented by a tree structure. Like the branches of a real tree, the links of a fat tree become "fatter" as one moves along the tree toward the trunk; the increasing "fatness" indicates the higher bandwidth of the inner links. For a binary tree, the bandwidth ideally increases by a factor of 2 at each level, as one proceeds from the leaves to the trunk.
fault-tolerant system A computing system that continues to operate correctly even if one (or more) of its components happens to fail.
field reordering Changing the order of fields in a structure (e.g., a C struct) to improve cache behavior. Commonly, the fields in a structure will be separated into hot and cold groups, so that the frequently-used or hot fields are more likely to be brought into cache together, in the same cache line.
file system A set of operating system conventions for managing the storage of files on disk. Modern OSes can make use of multiple file systems for different purposes. Storage that is directly attached to a computer can be formatted with a file system like ext4, the most common Linux file system. (ext4 is the fourth version of the original extension of the MINIX file system.) A file system like NFS (Network File System) can use network protocols to provide access to storage attached to a remote computer.
fine-grained locking Locking only the specific data required by a transaction, rather than some larger dataset of which the data are a part. Fine-grained locking increases opportunities for parallelism at the expense of doing more work, since there are potentially many more locks to set, check, and free.
flat-MPI A programming approach in which all parallelism is expressed via MPI, and each MPI process is assigned to a separate CPU core in the system, with no regard for architectural features such as shared vs. distributed memory, fast vs. slow link speed, etc.
flip-flop A circuit element that can be used to store one bit of state information. Since it has two stable states (1 and 0, true and false), it is a suitable element for storage and logic circuitry.
flood bandwidth The effective bandwidth of a data link when it is saturated with relatively large messages.
flops FLoating-point Operations Per Second. Used to measure a computer's performance. The flop/s notation is often preferred because it removes ambiguity about the meaning of the "s" (it's not merely a plural). It can be combined with common prefixes such as M=mega, G=giga, T=tera, and P=peta.
foo A common placeholder name or metasyntactic variable often used in computer programming or documentation. If a second placeholder name is needed, the customary choice is "bar".
foreign function interface (FFI) A mechanism or API by which one programming language may call functions or routines or access variables in another programming language.
fork The action of having one thread launch multiple new threads, increasing the number of parallel workers available to the code.
FPGA Field-Programmable Gate Array, an integrated circuit that is designed to be configured by the customer after manufacturing ("in the field"). The typical FPGA includes programmable logic blocks, RAM blocks, and reconfigurable interconnects that allow the blocks to be "wired together" to meet the intended purpose.
framework A programming abstraction that allows a user to build application software by making selections from a set of generic software components.
FTP File Transfer Protocol. A client-server protocol for transferring files over TCP, with a variety of clients. Uses two channels (ports), one for control and one for data.
functional parallel A programming model in which a program is broken down by functions, and parallelism is achieved by assigning each function to a different worker (task). Message passing libraries such as MPI are commonly used for communication among the processors. An alternative to functional parallelism is data parallelism.
functional programming A programming paradigm in which functions are treated as mathematical functions, in that they do not have side effects (or have as few side effects as is reasonable) and do not change an internal state after being called. Additionally, functions can be treated as variables that may be operated on by other functions, and like variables, also have a well-defined type, largely based on the arguments and outputs of the function.
functional unit Any element in the CPU that performs operations and calculations. It may have its own control unit, registers, and computational units.
fused multiply-add (FMA) A floating-point combined multiply–add operation performed by a single instruction, with a single rounding at the end. Typically, one such operation is performed each cycle through pipelining.
game design The creative process for developing a computer game, including facets such as rules, gameplay, and storyline.
garbage collection (GC) A form of automatic memory management (users of the language or library need't call functions likes malloc or free to allocate and free memory) whereby the garbage collector keeps track of object references, and when there are no references to a given object, will decide when to free the object (i.e., delete its memory contents and relinquish the memory to the operating system).
GASNet A low-level networking layer for PGAS languages, providing network-independent, high-performance communication primitives.
gateway A single, unified point of entry into a community-developed suite of networked data, tools, and applications. The components are usually integrated into a customized graphical user interface, such as a Web portal or a suite of applications, which has been tailored to meet the needs of the specific community (e.g., a Science Gateway).
Gaussian elimination An algorithm for solving a system of linear equations. It is essentially equivalent to LU decomposition (LU factorization) of the matrix of coefficients.
Gaussian Mixture Model (GMM) A probabilistic model that assumes all the data points are generated from a weighted, finite sum of Gaussian distributions with unknown parameters (means and variances).
ghost cells In domain decomposition, a layer of extra computational cells that surrounds each computational subdomain, usually only 1 or 2 cells thick. The ghost cells are meant to hold border values that are computed in neighboring subdomains and received via message passing. Also called shadow cells.
global address space In parallel programming, a way of mapping memory across machines that hides the distinction between shared memory and distributed memory. If remote data are required by a thread, the data are fetched (or prefetched) from distributed memory via mechanisms invisible to the thread.
global memory Main memory that is directly addressable by all processors or CPUs. It can be either shared memory or distributed shared memory (Partitioned Global Address Space, PGAS).
Globus A research data management system that provides fast, reliable and secure file transfers and sharing.
Globus Connect Personal Software that creates a Globus endpoint on your laptop or other personal computer, allowing you to transfer and share files. Globus Connect Personal is available for Mac OS X, Windows, and Linux operating systems.
Globus endpoint A file transfer location used as the source or destination for a Globus file transfer. Once a resource (server, cluster, storage system, laptop, or other system) is defined as an endpoint, it will be available to authorized users who can transfer files to or from this endpoint.
Globus ID An identity available from Globus for logging into Globus. If you don’t want to use your existing organizational login (e.g. university, national lab, Google) with Globus, you can create a Globus ID and use that to log into Globus. While Globus ID is not needed to use Globus capabilities via the web and API, Globus ID is required to use the Globus CLI (Command Line Interface).
Globus task ID A unique identifier associated with every asynchronous Globus transfer or delete operation. This ID can be used to inquire about the task's status and to cancel the task.
GPGPU General-Purpose computing on Graphics Processing Units. The utilization of a GPU for performing numerically intensive computations that would ordinarily be done on one or more central processing units (CPUs).
GPU The Graphics Processing Unit of a computer. It is designed to assist the CPU by taking a stream of graphics input and quickly converting it into the series of images that one sees on a computer display. More recently, GPUs have also become accelerators for general floating-point computations, including HPC applications.
granularity The relative number and size of independent tasks that could be distributed to multiple CPUs. Fine granularity is illustrated by execution of sets of inner-loop iterations as separate tasks; coarse granularity might involve one or more subroutines as separate tasks. The finer the granularity, the the more overhead is required to keep track of the tasks. Granularity can also describe the relative duration of tasks. Granularity in number, size, and time profoundly affects the overhead and speedup of a parallel program.
graph algorithm An algorithm that takes one or more graphs as inputs and solves a problem in graph theory. Graphs are mathematical structures that model pairwise relations between objects.
graph partitioning The problem of subdividing a graph G into smaller components based on specific criteria. For example, a uniform graph partition problem seeks to form components of G that are close to the same size and have few connections between the components.
graph traversal Visiting all the nodes of a given graph, in order to update or check data values that are associated with the nodes of the graph.
graphical model A representation of the conditional dependencies between random variables; used in artificial intelligence (AI) programming.
Gray code addressing Constructing addresses for the nodes in a hypercube network topology by using a binary numerical system (Gray code), in which two successive integer values are represented with only one bit difference. For a d-dimensional hypercube, each node is connected to d other nodes whose addresses are 1 bit different.
greedy algorithm A strategy where one makes a locally optimal choice at each stage with the hope of finding the global optimum. While this strategy is not guaranteed to produce the global optimum, it may approximate an optimal solution in a reasonable time.
grid (mesh) A collection of sample points, and their logical links to neighboring points, used to construct a finite representation of a continuum object. A "structured" grid or mesh has regularity in the selection of sample points; otherwise, the grid is said to be "unstructured".
grid computing The use of aggregated, heterogeneous computer resources from multiple locations to achieve a common goal. The typical workload is non-interactive (or very loosely coupled) and file-based. A grid can either be devoted to one particular application or used more flexibly for a variety of purposes.
GUI Graphical User Interface. A user interface built on graphical elements rather than text (although it may include text). Examples include the Gnome Shell and the default user interfaces of Microsoft Windows and Apple's OSX.
Gustafson's Law A reformulation of Amdahl's Law in which the per-process parallel workload becomes the fundamental quantity, rather than the single-process sequential workload. Since the former can grow with the number of PEs, whereas the latter typically does not, scalability becomes a more reachable goal. The law is stated as S = P - a*(P-1), where S is the "scaled speedup", P is the number of PEs, and a is the sequential fraction of the parallel workload.
Hadoop An open-source software framework that couples the MapReduce algorithm to a distributed file system. It enables data-intensive distributed applications to run on large clusters of commodity hardware.
hardware latency In a computer network, the message delay imposed by the physical hardware alone; also called "wire time". In present-day systems, the hardware latency is usually negligible compared to the software latency for performing message-passing between sending and receiving nodes.
hash The value returned by a hash function, an algorithm that creates a small key of fixed length from the information contained in a larger data set.
HDF Hierarchical Data Format, a set of self-describing file formats for storing and organizing large amounts of scientific data. HDF (or HDF5) can also refer to the APIs used to perform I/O on such files; the main library that implements the HDF APIs; and various utilities for manipulating, viewing, and analyzing data in HDF files. All are maintained by the nonprofit HDF Group.
heap An area of main memory used for dynamic memory allocation. Memory requests like malloc() in C or allocate() in Fortran are fulfilled by granting them unclaimed or "free" portions of heap memory.
heterogeneous computer Computer that utilizes multiple types of processors in the system, for example, both CPUs and GPUs. When programming such a system, one strives to take advantage of the strengths of the different processor types, e.g., the low communication latency of the CPUs and the high throughput of the GPUs.
Hidden Markov Model (HMM) A type of statistical model in which the system being modeled is assumed to be a Markov (history-independent) process whose states are hidden (not observable). It differs from ordinary Markov models in that just some state-dependent output is visible, not the states themselves.
High Throughput Computing (HTC) The use of many, often heterogeneous, computers to execute work on loosely coupled tasks. Unlike HPC, the HTC model of computation is generally intended to get reliability out of large, distributed infrastructure, in order to provide answers over time, rather than in the shortest possible time.
home Directory you are in when you first log in.
hop A segment of the communication path between source and destination. Data packets often have to travel through multiple routers or switches before reaching their final destination. A hop occurs each time a packet is passed along to the next router or switch.
hostfile A file containing a list of IP addresses that is used to map MPI ranks onto hosts. The IP address of a particular host can (and typically does) appear more than once. Each MPI rank is assigned to the host named on the corresponding line in the hostfile. If there are more ranks than entries in the list, then assignment cycles back to the start of the list, as many times as necessary.
hot/cold splitting Taking a data structure such as a C struct and splitting it in two to improve cache behavior. Generally the first struct contains just the frequently-used or hot fields, plus a pointer to the second struct, which stores the lesser-used or cold fields. Similarly, the cold lines in a subprogram can be split into a separate routine, making these lines less likely to clutter up the instruction cache.
HPC High Performance Computing (HPC) refers to the use of a computer capable of concurrent computation to obtain results in the shortest possible time. Users generally turn to HPC when facing extraordinary requirements, e.g., a need for excessive amounts of memory or much shorter wait times than are available on typical computers. HPC therefore involves paying attention to many factors that influence performance, including the speed of computational units, the size of the available memory, and communication capabilities among all components of the computer.
HPCC HPC Challenge benchmark, a suite of tests that includes HPL, STREAM, and DGEMM, plus four others.
HPL High Performance Linpack, the version of the Linpack benchmarks that is used to rank the HPC systems in the Top 500 list twice per year.
hybrid programming A style of programming which combines distributed-memory and shared-memory techniques. In this style, the number of tasks per node is typically smaller than the number or cores per node. To exploit all the available cores, message passing (e.g., MPI) is used for communication between tasks, and multithreading (e.g., OpenMP) is used within each task.
hyperthreading Intel's term for simultaneous multithreading (SMT), in which the number of the number of apparent cores exceeds the number of physical cores. This enhances thread level parallelism (TLP) by better exploiting the multiple functional units that are available to each physical core, resulting in faster parallel execution of computational threads.
hypervisor Computer software, or more rarely nowadays, firmware or hardware, that can create and run virtual machines (VMs).
i/o Input/Output, often referring to moving data in and out of the computer CPU, memory or storage.
image A template of a computer's root drive, containing the operating system and possibly software, data and configurations.
implicit parallelism A characteristic of a programming language or API in which parallelism is already expressed through the language's built-in constructs, so the compiler or interpreter can automatically exploit the parallelism in the code. Such a language does not require special directives or functions to enable parallel execution.
inference engine A tool from artificial intelligence that applies logical (if-then) rules to a preexisting knowledge base and deduces new knowledge.
InfiniBand A high speed communication link used in HPC. It features high throughput and low latency.
instance A running copy of a bootable image that has been launched in a cloud computing environment.
instruction-level parallelism The potential for a computer program to execute its low-level operations in parallel. ILP is enabled when a microprocessor has multiple functional units that can operate concurrently. It is assisted through software (the compiler).
interactive session A means of interacting with a worker node directly in a batch-scheduled system, allowing for the development and testing of scripts which will be executed automatically when they are ready. Different systems provide different means of requesting an interactive session, depending on the scheduler.
interoperability The capability of a software application to make use of different kinds of computers, OS's, software components, and interconnecting networks.
interpreted language A programming language in which programs are "indirectly" executed by an interpreter program. This can be contrasted with a compiled language which is converted into machine code and then "directly" executed by the host CPU. Interpreted language does not require compilation and is typically run line by line.
intrinsic function (Fortran) Functions which are built into the Fortran language specification and are expected to be recognized by the compiler.
iterative method A computational procedure that generates a sequence of approximate solutions to a problem, until some error tolerance, convergence criterion, or other exit condition is met. GMRES is an iterative method for solving a system of linear equations, e.g. In contrast, a direct method solves the problem through a prescribed, fixed set of operations (e.g., Gaussian elimination). Parallelization may be possible in the construction of each approximate solution (or single direct solution).
kernel A key element of operating systems; or, a key section within an application program. In operating systems, the kernel is the computer program that sits between and controls the interaction of the software and the computer's resources. The kernel is responsible for scheduling processes on the available hardware, for example. In an application program, a kernel is a compact section of code that expresses some well-defined, repetitive task, making it a good target for performance optimization, or perhaps for GPGPU recoding.
kernel parallelism A paradigm of computer programming in which an entire array of similarly-structured data objects, the stream, is acted on concurrently by a specially defined function, the kernel. This is the central paradigm behind NVIDIA CUDA and OpenCL. It is also called stream processing.
keyword A word or identifier that has a particular meaning to the programming language; common examples are control flow words: for, while, if, else.
KNC Knights Corner, the code name for Intel's first generation Xeon Phi product based on its MIC (Many Integrated Core) architecture. KNC is a PCI-attached coprocessor which provides a large number of cores for parallel computations. It can run processes natively, or threads can be offloaded to it from a code running on the main processor.
KNL Knights Landing, the code name for Intel's second generation Xeon Phi product based on its MIC (Many Integrated Core) architecture. KNL provides a large number of cores in a stand-alone, self-booting processor. In addition to standard socket-attached memory, KNL includes on-package MCDRAM memory, which can act as RAM, L3 cache memory, or a mix of the two.
LAPACK Linear Algebra PACKage, a software library for numerical linear algebra. It has routines for the solution of linear systems, linear least squares, eigenvalue problems, singular value decomposition (SVD), and various matrix factorizations.
Larrabee A prototype for Intel's many-core MIC architecture and their Xeon Phi product line.
latency The overhead time to initiate a data transfer. In the case of message-passing, it is the time to send a zero-length message from one node or core of a parallel computer to another. Latency is therefore the part of the time to complete a data transfer that is independent of the size of the message.
layered systems pattern In software architecture, a structural pattern in which the overall system is configured into layers, such that each layer only interacts in a limited way with adjacent layers using a prescribed protocol such as function calls (Shaw and Garlan, 1996). The implementation of each layer then becomes much simpler and cleaner, and each layer may be tested independently and becomes much more portable.
lexically scoped synchronization Synchronization or coordination of multiple threads conferred through a language-specific property of the program text itself; e.g., synchronized methods in Java.
linear programming A special case of mathematical optimization in which the function to be optimized for minimum cost or maximum benefit is linear in the model variables.
linear speedup The ideal or perfect parallel speedup that results when a code runs N times faster on N CPUs (or cores). If the speedup S(N) is plotted against N, where S(N) = (sequential time)/(N-way parallel time), the graph is a straight line with slope 1.
linked list A data structure in which a sequence is represented by a group of nodes. Each node consists of a datum plus a pointer or link to the next node in the sequence. This data structure allows for the insertion or removal of nodes at any point in the list, without disturbing the rest of the list.
literal A literal is a constant value of a type, unlike a variable which can change. Examples are 42, or "answer" (a string literal).
livelock A situation in which parallel tasks are too busy responding to each other to resume working. They are not blocked, as in deadlock; their mutual actions simply do not lead to progress. The phenomenon is analogous to two people in a hallway who can't pass each other because both first dodge one way, then both dodge the other way, etc.
load balance A goal for algorithms running on parallel computers, which is achieved if all the workers perform approximately equal amounts of work, so that no worker is idle for a significant amount of time.
lock A way to enforce mutual exclusion (mutex) between threads that require synchronized access to the same data location or device. A code section that uses locks for this purpose is called a critical section. In OpenMP, the "critical" clause implicitly creates such locks around the specified code section.
logic gate In electronics, a circuit element that implements a Boolean function (AND, OR, etc.). Logic gates are primarily built with diodes or transistors acting as electronic switches.
login node Node in a cluster generally used for interactive work, e.g., for submitting jobs to batch processing systems, tracking running jobs, editing and managing files, and compiling and debugging code. Generally, one is advised to refrain from running significant computations on login nodes.
LogP machine Model for communication within a parallel computer based on 4 parameters: Latency, Overhead, communication Gap, and number of Processors.
loop unrolling A loop transformation technique that attempts to optimize a program's execution speed at the expense of its size. The transformation can be undertaken manually by the programmer or by an optimizing compiler. In this technique, loops are re-written as a repeated sequence of similar, independent statements. One of the advantages gained is that the unrolled statements can potentially be executed in parallel.
machine learning The implementation and study of computer systems that can "learn" from the data that they collect, e.g., recognize patterns or categories in the data after undergoing a training period. It is a type of artificial intelligence.
Manager Worker A programming approach in which one process, designated the "manager," assigns tasks to other processes known as "workers."
many-core processor A type of multi-core processor featuring tens or hundreds of cores, typically having smaller cache sizes and reduced capabilities compared to multi-core, but wider vector units. The design goal of many-core is to maximize parallel throughput.
MapReduce A programming model for processing large data sets in highly parallel fashion. It is named for its most famous implementation: the one that Google uses to create, sort, and store key-value pairs for its search engine.
mask An array of booleans that functions like a set of on/off switches to determine whether, or how, an element-wise operation is to be applied to other arrays with the same shape. In vector processing, a masked operation may be used to vectorize a simple if-else construct at the hardware level.
mass storage An external storage device capable of storing large amounts of data. Online, directly accessible disk space is limited on most systems; mass storage systems provide additional space that is slower and more difficult to access, but can be virtually unlimited in size.
matmul Matrix-matrix multiplication. A dense linear algebra operation that is a good target for optimization due to its potential for efficient vectorization and the reuse of cached data.
matrix A two-dimensional array of numbers or variables, arranged into rows and columns. A matrix is termed sparse if nearly all of its entries are zero; otherwise it is called dense.
MCDRAM Multi-Channel Dynamic RAM. Special high-bandwidth memory that is built into in the same package with a processor, as found in the Intel Xeon Phi "Knights Landing" product.
memkind Extensible heap manager developed by Intel for adding memory management policies to source code in order to take advantage of features such as NUMA and page sizes. Its API allows code-based memory optimization for Xeon Phi architectures such as KNL
memory hierarchy The layers of memory that can be accessed by the processor, arranged in order from fastest/smallest to slowest/largest. Typically, registers are at the top of the hierarchy, followed by the L1, L2, and L3 caches, then RAM, then the hard drive.
memory leak A situation that arises when memory is unintentionally reserved by a programmer when it should have been freed.
memory-bound A situation in which the speed for completing a given computational problem is limited by the rate of movement of data in and out of memory.
memory-level parallelism The ability of a computer system to process multiple memory operations (e.g., cache misses) simultaneously. MLP may be considered as a form of ILP or instruction-level parallelism.
message passing A communication paradigm in which processes communicate by exchanging messages via communication channels.
metascheduling The high-level direction of a computational workload through a metasystem composed of multiple systems, each with its own job scheduler and resources. A given job is queued for execution on these federated resources based on current and static characteristics such as workload, software, and memory. Also known as "automatic resource selection" or "global queue".
MIC Intel's "Many Integrated Core" architecture, an extension of the x86 architecture designed for 10s to 100s of cores that have greater SIMD capability per core. Because MIC is x86-compatible, it can make use of typical compilers for standard programming languages, as well as common interfaces and software tools for parallelizing codes. Xeon Phi is Intel's brand name for its MIC products.
microbenchmark In HPC, an attempt to measure the performance of some specific task that is performed by the processor. The code that is used in such a test usually performs no I/O, or else it tests a single, specific I/O task.
MIMD Multiple Instruction, Multiple Data. A type of parallel computing in which distinct processing units can execute different instructions on different pieces of data simultaneously. Examples of MIMD architectures are: distributed computer systems, superscalar processors, and multicore processors.
minimum spanning tree Given a connected, undirected, weighted graph, the minimum spanning tree of the graph is a subgraph that has the least total weight compared to all the other spanning trees, i.e., subgraphs that have the form of a tree and connect all the vertices together.
MKL Intel's Math Kernel library, a software implementation of BLAS, LAPACK, and other commonly-used mathematical and numerical functions. MKL routines are optimized for Intel microprocessor architectures.
MMX A SIMD instruction set for Intel CPUs that allows programs to process two 32-bit integers, four 16-bit integers, or eight 8-bit integers concurrently by using special 64-bit-wide integer units.
model-view-controller A software architectural pattern for implementing user interfaces. The model component embodies the application's data, logic and rules; data from the model are provided to the view component, which presents output to the user; and based on the view, the user gives input to the controller component, which sends the user's input to the model and/or view.
modularity A characteristic of software that has been separated into independent, self-contained units, each of which fulfills a single, distinct role. The modules interact with one another only through well-defined interfaces.
module A command that sets up a basic environment appropriate to the specified compiler, tool, or library.
Monte Carlo method Any computational method which uses random sampling to compute an approximation, as part of its algorithm.
Moore's Law The observation that transistor density in integrated circuits doubles roughly every two years; or, the prediction that this historical trend will continue into the future. The law is named after Intel co-founder Gordon E. Moore, who described the trend in a 1965 paper.
MOSFET Metal-Oxide-Semiconductor Field-Effect Transistor, a type of transistor in which the metal gate is insulated from the source and drain by an oxide layer. Due to low power consumption and fast switching times, MOSFETs are commonly used as logic gates in microprocessors and in other kinds of digital circuitry.
MPI Message Passing Interface, a de facto standard for communication among the nodes running a parallel program on a distributed memory system. MPI is a library of routines that can be called from both Fortran and C programs. MPI's advantage over older message passing libraries is that it is both portable (because MPI has been implemented for almost every distributed memory architecture) and fast (because each implementation is optimized for the hardware it runs on).
MPI communicator A data structure that can be associated with a group of MPI processes in order to hold additional attributes about the group's identity, use, means of creation and destruction, and the scope of its communication universe.
MPI group An ordered set of MPI processes. Each process in a group is associated with a unique integer rank. Rank values start at zero and go to N-1, where N is the number of processes in the group.
MPP Massively Parallel Processor. A computer system that utilizes a large number of processors to perform parallel computations. An MPP is similar to a very large cluster, but it features a specialized interconnect network. Each CPU possesses its own memory and runs its own OS.
multi-core processor A computer chip that holds one or more independent central processing units (called "cores") that read and execute program instructions.
multigrid method A method for solving elliptic PDEs by starting with a solution on a coarse grid and making successive refinements to the solution on a series of finer grids. This ultimately produces a better approximation on the coarsest grid. The process is then iterated until the solution has converged to some tolerance based on a residual vector.
multiplexer (MUX) An electronic device that selects one of several input signals and forwards the selected input to a single output.
multiprocessing The execution of multiple concurrent software processes in a computer system, especially a system with two or more hardware CPUs.
multithreading The hardware-based capability of executing more than one thread per core. From the hardware perspective, multithreading is distinguished from "shared memory multiprocessing", which refers to only 1 thread/core. But from the software perspective, both situations can be termed multithreading.
mutex Mutual exclusion, which means ensuring that no two threads can concurrently execute a set of instructions that must be run by one thread at a time (critical section). A lock or other mutex mechanism allows only a single thread to enter the critical section, forcing other threads to wait until the thread has exited from that section.
NAS Parallel Benchmarks A small set of programs from derived from some computational fluid dynamics (CFD) applications at NASA. The benchmarks are intended to gauge the performance of different large-scale parallel computers.
Navier-Stokes equations The fundamental mathematical model of viscous fluid dynamics.
NetCDF Network Common Data Form, a set of self-describing, platform-independent data formats designed for array-oriented scientific data. NetCDF can also refer to the software libraries that support reading, writing, and sharing data in this form. NetCDF 4 recognizes HDF5 as an underlying data format.
network topology Conceptual arrangement of the various elements (links, nodes, etc.) of a computer network, expressing the connectivity properties of the network.
NIC Network Interface Controller. A computer hardware device connecting a computer to a network.
node One of the individual computers linked together by a network to form a parallel system. User access to a node is mediated by an operating system which runs on that node. A node may host several operating systems as virtual machines. One node may house several processors, each of which may, in turn, have multiple cores. All cores on a node have access to all of the memory on the node, but the access may be nonuniform. The term derives from a "node" on the connectivity graph that represents the computer network.
node (graph) An object in a graph that is connected to one or more other objects through links. Nodes are also referred to as vertices, and links are also called edges.
Normalization In Relational Databases, normalization is a method for organizing the database into tables in such a way as to minimise the occurence of anomalies or data inconsistencies upon inserting, updating or deleting data
NUMA Non-Uniform Memory Access. In many computers, memory is divided among subgroups of cores such that a given subgroup has faster access to "local" data, due to traveling through fewer controllers and/or across wider buses. The performance of a program can often be improved by assigning memory and core usage to ensure that data are stored in the memory locations that are most accessible from the cores that will use the data. Uniform Memory Access (UMA) is the opposite: there is a single memory controller with a unified bus to all of the memory, so that no core is favored in accessing any particular memory location.
numactl Linux tool for managing memory allocations and pinning processes to specific CPUs, which allows processes to steer their memory allocations to different physical locations in order to improve performance at runtime. Pronounced "NUMA control".
octree A tree data structure or branching graph in which each internal (non-leaf) node has exactly eight children.
Omni-Path Intel's high-performance communication architecture, offering low latency, low power consumption, and high throughput. Initial (2016) speeds were 100 Gbit/s/port, similar to 4X EDR InfiniBand.
one-sided communication Message passing that can be initiated by one process or thread acting alone, if RDMA is available. Instead of needing a matching receive or send call, a sender can simply "put" a message, or a receiver can simply "get" a message.
online algorithm An algorithm that is able to work with input as it arrives, without the complete input being available from the start. In contrast, an offline algorithm requires all the input data to be present from the beginning in order to produce a solution.
OpenACC A directives-based, portable programming model intended to simplify parallel programming of heterogeneous systems comprised of CPUs and GPU accelerators. Like OpenMP 4.0+, OpenACC directives and functions are able to target both CPUs and GPUs for parallel execution of code.
OpenCL A programming framework for writing codes that can execute across a platform consisting of CPUs, GPUs, and other processor types. The software was originally developed by Apple Inc.
OpenGL Open Graphics Library. An API for interacting with the graphics processing unit (GPU), allowing programmers to optimize the 3D rendering performance through hardware acceleration.
OpenMP A set of software constructs that are expressed as directives to a compiler that cause sections of code (typically, iterations of loops) to be run in parallel on separate threads. This gives a programmer the advantages of a multithreaded application without having to deal with explicitly managing the creation and destruction of threads.
OpenSHMEM An API for parallel programming which creates a virtual shared-memory space in a distributed-memory system. SHMEM stands for Symmetric Hierarchical MEMory: the "symmetric" segment is shared, while other memory is private. The "shared" memory can be accessed via one-sided communication operations from remote PEs. OpenSHMEM attempts to standardize and subsume several previous SHMEM APIs from SGI, Cray, and Quadrics. It can provide a low-level implementation layer for PGAS (Partitioned Global Address Space) languages like UPC (Unified Parallel C).
operating system A collection of system software - such as a kernel, common software libraries, and utilities - that manages hardware and software resources on a computer to allow applications to run and users to interact. Hardware resources managed by the operating system include central processing units (CPUs), graphics processing units (GPUs), random-access memory (RAM), input/output (I/O) devices, and network connections. Software resources can include security (such as identity management), process scheduling, and a user interface (UI) or shell.
optical flow Computational methodolgies from image processing and navigation control that are needed for robotics, including motion detection, object segmentation, and the computation of pixel displacements between consecutive images
optimization The act of tuning a program to achieve the fastest possible performance, or consume the least resources, on the system where it is running. There are tools available to help with this process, including optimization flags on compiler commands and optimizing preprocessors. You can also optimize a program by hand, using profiling tools to identify "hot spots" where the program spends most of its execution time. Optimization requires repeated test runs and comparison of timings, so it is usually worthwhile only for production programs that will be rerun many times.
Our Pattern Language (OPL) "A design pattern language for engineering (parallel) software." This description is also the exact title of OPL's defining document, co-authored by Kurt Keutzer (EECS UC Berkeley) and Tim Mattson (Intel). OPL is one outcome of an ongoing project centered at UC Berkeley's Par Lab.
out-of-core algorithm An algorithm designed to process data that are too large to fit into a computer's main (or "core") memory all at once. Such an algorithm must be very efficient at fetching data from, and storing data to, the slow bulk memory, e.g., a hard drive.
out-of-order execution A way to avoid processor stalls by keeping multiple instructions in a queue and permitting the execution of any of them as soon as their operands have been loaded from memory. To preserve program semantics, the results from older operations must be stored ahead of the results of newer operations.
overhead Any hidden, extra utilization of CPU time, memory, bandwidth, or other resource that is required to achieve a particular computational goal. Examples from parallel computing are: communication overhead that is required to initiate message passing in MPI, or OS overhead that is required to fork a thread team in OpenMP.
overlap Computing and communicating simultaneously; or, communicating while other communications are in progress (a.k.a. pipelining).
overload A general mechanism employed by some programming languages that will allow different functions or operators to be called using the same symbol (or name); the underlying function called depends on the type of the argument(s) passed to the overloaded symbol.
packet overhead The additional cost to transmit data in the form of discrete packets that contain more than just raw data. Packet overhead is typically due to the extra information embedded in the packet header, which is required to be assembled prior to transmission and disassembled after being received.
padding Inserting meaningless data entries between the end of the last data structure and the start of the next in order to produce favorable byte-boundary alignment (e.g., padding each row of an array in C).
page A fixed-length, contiguous block of virtual memory. One page is the basic unit of memory that is communicated between main memory and disk storage. Its size is often 4KB.
paging A memory management technique that permits the total virtual memory of a computer to exceed its real physical memory by storing pages (typically 4 KB blocks) of virtual memory on disk. Paging can also refer to the process of swapping pages between disk and RAM, in order to allow memory that was "paged out" to be accessed again by the processor.
parallel file system A network file system that utilizes multiple connections to file servers in parallel to deliver higher read and/or write speeds for workflows that require significant amounts of fast I/O. Commonly-used parallel file systems include IBM''s GPFS and the open-source Lustre file system.
parallel loops Loops in which the dependencies have been removed and transformations have been applied so that each iteration may be executed in parallel on multi-core or SIMD resources. The OpenMP "omp for" (or "omp do") and "omp simd" directives are appropriate for multithreading and vectorizing such loops. Identifying and enabling a program's parallel loops gives an incremental way to parallelize it.
parallel prefix sum A parallel algorithm that computes a prefix sum (also called a scan or cumulative sum). Given the sequence {x(0), x(1), x(2),...}, the corresponding prefix sum is {x(0), x(0)+x(1), x(0)+x(1)+x(2),...}. It can be generalized to apply to binary operations other than addition. For n steps, the parallel prefix sum has O(log n) cost.
parallel processing Computing with more than one thread of execution, in order to process different data or perform different functions simultaneously. The threads may be in a single task, or there may be multiple tasks, each with one or more threads.
parallel programming Writing programs that use special languages, libraries, or APIs that enable the code to be run on parallel computers. Parallel programs can be classified by the assumptions they make about the memory architecture: shared memory, distributed memory, or distributed shared memory. For example, OpenMP is an API for shared memory, MPI is an API and library for distributed memory, and both of them can be used together for distributed shared memory.
parallelism An inherent property of a code or algorithm that permits it to be run on parallel hardware. There are several types of parallelism that can be identified in a program, at different scales: coarse-grained, fine-grained, and loop-level. The matching types of parallelism in computer hardware would be: cluster, symmetric multiprocessor (SMP), and vector processor. All types can be exploited at once, if the program is parallelized to run on a cluster of SMPs with vector units.
parallelization Splitting of program execution among many threads or processes that can perform operations simultaneously on different cores of a computer. A program that has been efficiently parallelized will use all of the cores that are available to it nearly all of the time.
particle-mesh method A parallel method for computing far-field forces in a particle system. A regular mesh is superimposed on the system, and far-field forces are computed as follows: (1) each particle is assigned to a nearby mesh point; (2) the mesh problem is solved using a standard parallel algorithm such as FFT or multigrid; (3) the resulting forces are interpolated from the mesh points back to the original particle positions.
partitioned array A data structure representing an array that has been divided into subarrays. In the partitioned array, elements may be stored and indexed by their subarray positions, in order to improve data locality during subarray operations. This is especially helpful if the subarrays are to be the targets of parallel operations.
partitioning Restructuring a program or algorithm in semi-independent computational segments to take advantage of multiple CPUs simultaneously. The goal is to achieve roughly equivalent work in each segment with minimal need for intersegment communication. It is also worthwhile to have fewer segments than CPUs on a dedicated system.
passthrough Taken from signal processing where the term is used to describe logic gates that minimally alter a signal, it is also applied to hypervisors that allow a guest OS to directly (or nearly directly) communicate with the host hardware.
path Tells the OS where to find a file. There are two kinds of paths, full and relative. A full path is the complete path to a file using the entire directory structure; it does not rely on where you are in the directory tree. A relative path starts with where you are to define where the file is located. The full path will always be the same, the relative path will depend on where you are.
PE A processing element, especially one that takes part in a parallel computation. It can be a machine (node) on a network, a processor chip, or a core within a processor. A PE has the ability to run an independent process or thread and access the associated memory.
permutation An operation that gives a unique rearrangement of the items in an ordered list, without adding or omitting any items.
petascale computing Computation on the order of a quadrillion (10^15) floating-point calculations per second (at least one petaflop per second) or a computer system capable of computation at this scale.
PETSc Portable, Extensible Toolkit for Scientific Computation (pronounced PETS-cee). A suite of parallelized data structures and routines for the scalable solution of various PDEs that typically occur in scientific applications. Interfaces exist for C, C++, Fortran, and Python.
PGAS Partitioned Global Address Space. A parallel programming model based on the assumptions that (1) all processes or threads have access to a shared, global memory address space that is logically partitioned, and (2) a portion of the global memory is local to each process or thread. Example of languages that incorporate such a model are UPC, Coarray Fortran, Chapel, and Global Arrays.
PIC method Particle-in-cell method, a technique for solving certain sets of partial differential equations that arise in physics simulations. In the PIC method, representative fluid particles or point masses move through, and are influenced by, a grid on which the continuous physical fields are discretized; these fields can in turn change in response to the particles.
PID A unique identifier for a Linux process.
ping-pong An MPI program that consists of two processes transmitting messages back and forth, swapping the roles of sender and receiver each time.
pipe A way to chain simple commands together; the output of the first command becomes the input for the second command. Represented by "|".
pipelining Decomposing a computation into a sequence of steps which may be executed concurrently, as different data items proceed from step to step to step. Pipelining is often used to increase the utilization of functional units in the processor, leading to greater computational throughput. Common microprocessors have between four and twenty pipeline stages in their floating-point functional units, for example. On NVIDIA GPUs, the CUDA Streams interface enables high-level pipelining.
pipes and filters In software architecture, a style in which the software components ("filters") are arranged so that each component reads one or more streams of data on its inputs and produces one or more streams of data on its outputs (Shaw and Garlan, 1996). Thus, data are imagined to pass through "pipes" connecting the filters. Standard Linux shell commands are an example of this style.
pivoting In Gaussian elimination, transposing the rows of a matrix in order to improve numerical stability in achieving the final result.
point-to-point communication Pairwise communication between two nodes in a computer network, usually implemented as a "send" on one side and a "receive" on the other.
pointer A data type for a variable whose value is an address pointing to another value stored elsewhere in memory.
portability The usability of a software program across different platforms or systems.
POSIX Portable Operating System Interface for *X systems (Unix, Linux). It ensures software compatibility between variants of Unix and related operating systems. It defines command line shells, utility interfaces, and the underlying API for the OS.
post-order traversal A depth-first search of a tree starting from the leaves, so that the traversal of a parent node takes place only after all its child nodes have been traversed.
power Energy per unit time, usually measured in watts (W). Computers require electric power to run, and since some of the power is dissipated as heat, they also require cooling. Cooling takes power as well, e.g., to run fan motors. Power and cooling requirements often set practical limits on the size and/or speed of a computing system.
pragma A directive instructing the C/C++ compiler to use pragmatic or implementation-dependent features at the given line in the source code. The line begins with #pragma, so it appears to be just a comment if the particular feature is unavailable to the compiler. This mechanism is used by OpenMP and OpenACC, among others.
PRAM (computer science) Parallel Random-Access Machine, an idealized shared-memory machine. It is used by parallel-algorithm designers to model an algorithm's parallel performance. For the the sake of simplicity, the PRAM model assumes a perfect SMP machine and neglects the cost of synchronization and communication. Sometimes called Parallel Random Access Memory.
PRAM (memory) Phase-change Random Access Memory, a type of non-volatile RAM. It relies on the properties of chalcogenide glass, which is stable in either its amorphous or crystalline phase depending on its temperature prior to cooling. The temperature can be controlled by passing an electric current through a heating element.
prefetching Prefetching occurs when the processor requests a cache line from RAM before needing to use the data stored in the cache line. The data are stored in cache after the request, and can be accessed quickly later.
private variable In shared-memory parallel programming, a variable whose scope is local to each thread, so it is not accessible to other threads. In OpenMP, private variables are identified in a private() clause. In contrast, shared variables are accessible to all threads.
process A task executing on a given processor at a given time. The operating system treats each process as an independent entity and schedules it to run on system resources. Each process maintains its own virtual address space, which the OS maps into physical memory.
processor The part of the computer that actually executes instructions. Commonly it refers to a single physical chip in a node. That chip may contain multiple cores or CPUs, each of which can execute operations independently.
processor-DRAM gap The observation that processor speed has historically increased at a much faster rate than the speed of dynamic RAM, which has given rise to a performance bottleneck due to slow memory accesses. Also called the "memory wall". It has led to the development of a sophisticated hierarchy of in-processor cache memory.
producer-consumer model A common pattern in concurrency in which one thread produces results while a second thread consumes results. Synchronization is needed to ensure that thread A doesn't try to add data to the shared buffer if it's full, and thread B doesn't try to remove data from the shared buffer if it's empty.
profiling Determining where a program spends its time when it is executing. Some programs have a lot of conditionally executed code, so it is important to provide a program with one (or more) typical inputs to get useful profile data. Profilers work in one of two ways. (1) They can produce a call graph by making a record every time a function call is made. This shows who called whom and for how long. (2) They can record the call stack at regular intervals. This method is less intrusive than the other, so it gives more accurate overall results but the details are limited by the statistical nature of the data.
Pthreads The POSIX standard for threads. Pthreads defines an API consisting of a set of types, functions and constants in the C programming language. A Pthreads implementation is expected to have a pthread.h header and a thread library.
public-key encryption Any cryptographic system that uses pairs of keys: a public key which may be disseminated freely, together with a private key known only to the owner. Such systems use an algorithm like DSA or RSA to encrypt and decrypt data. Applications such as PGP and SSH use public-key encryption to provide privacy and to secure network connections.
puppeteer pattern A software architectural pattern in which a controlling program encapsulates and coordinates references to each of several submodels, or "puppets". It does so by delegating operations to the puppets and passing return values from the methods of one as arguments to the methods of another. The submodels can be quite complex, as in a multiphysics simulation.
PVM Parallel Virtual Machine. Software that allows a collection of heterogeneous computers, connected through a common network, to be used as if they were a single distributed parallel system. PVM consists of an API, library, and run-time environment for enabling message-passing. Since the mid-'90's, it has largely been superseded by MPI.
pwd An acronym for Print Working Directory, which displays your present working directory.
quadtree A tree data structure or branching graph in which each internal (non-leaf) node has exactly four children.
quality attributes Non-functional requirements that provide the means for measuring the fitness and suitability of a product. Examples of quality attributes include stability and portability.
queue An ordered list of jobs submitted to a job scheduler. A job scheduler may maintain multiple separate queues corresponding to distinct sets of resources.
queuing locks A locking scheme for processors waiting in a queue that reduces contention compared to sharing a single lock. Each processor in the queue spins on a different memory location until it acquires the lock by detecting a change in the memory state at its spin location. Once the active processor has done its work, it passes the lock by resetting the memory state in the next processor’s spin location.
race condition Also called "data race". In shared-memory systems, a situation that arises when multiple concurrent writes (or overlapping reads and writes) try to access the same memory location. The final result is non-deterministic as it depends on the scheduling of thread operations. A race condition is very hard to debug, so it is best to impose order in a problematic code section by turning it into a one-thread-at-time "critical section", using synchronization constructs.
RAID Redundant Array of Independent Disks. A storage technology in which multiple disk drives are combined into a single logical unit. Stored data files are distributed ("striped") across the component disk drives in order to achieve greater parallel performance, data protection, or both.
rank A unique number in the range 0 to (N-1) that is assigned to each process in an N-way parallel run using MPI.
RDBMS Relational Database Management System. A software system which manages relational databases; well-known commercial examples include Oracle, Microsoft SQL Server, IBM's DB2 and Sybase. Open-source RDBMSs include MySQL, PostgresSQL, SQLite and MariaDB.
RDMA Remote Direct Memory Access. Access to the memory of one computer directly from a process running on another computer, without involving the operating system of either. It permits high-throughput, low-latency networking.
reduction An operation that takes data from all processes in a group, performs a specified operation on the data (such as summing), and stores the results on one process. Reduction is often useful at the end of a large distributed calculation, where all processes operates on part of the data before combining the partial results into an overall result.
register A small quantity of storage available as part of a CPU or other digital processor. Almost all computers load data from a larger memory into registers where it is used for arithmetic, manipulated, or tested, by some machine instruction. Usually, any altered data are stored back into main memory. Registers are at the top of the memory hierarchy and provide the fastest way to access data.
register renaming Assigning a different register to hold an operand or a result, in order to avoid "data hazards" such as a read-after-write dependency, which would otherwise inhibit the parallel execution of instructions. Register renaming can be done in hardware, especially if there are more physical registers than there are named registers in the instruction set.
regular expression A text-based mechanism for specifying a pattern (with wildcards, etc.), so that it can be determined if some given text matches the desired pattern. The concept of regular expressions was first popularized by some of the utilities that were available in the Unix OS. A common abbreviation is "regex".
relaxed consistency A characteristic of most modern microprocessors, in which the out-of-order execution of low-level memory operations is permitted. The hardware is allowed to reorder certain load and store operations, provided that the order of memory operations in the high-level program is unaffected.
rendezvous protocol A point-to-point communication method that requires handshaking before the message is transferred.
Rock's Law The observation that the cost of a semiconductor chip fabrication plant doubles every four years. It is sometimes called Moore's Second Law, since the increasing cost can be directly linked to the density of transistors.
roofline model A framework for estimating the upper bound on the performance of executed code as a function of its computational intensity (operations count divided by memory traffic). The "roofline" curve consists of two sections, corresponding to two inherent, hardware-specific limitations: a sloped portion set by the memory bandwidth, and a flat portion set by the peak flop rate. Ref. - Williams, S., Waterman, A. and Patterson, D., "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Communications of the ACM, 52(4), 65-76 (2009).
root directory The top level directory in a linux system. Represented by /. All other directories are under this directory.
round-robin Apportionment according to the round-robin principle, where each participant takes an equal share of something in turn.
SAXPY/DAXPY A subroutine in the Basic Linear Algebra Subprograms (BLAS) that performs a single/double precision multiplication of a scalar times a vector and adds a second vector: ax + y.
scalability A characteristic of an algorithm or application code denoting that it will remain efficient and practical to use when applied to larger and larger situations.
scalably parallel In reference to a program: having the property that an increase in the number of tasks and cores results in (1) a proportional decrease in execution time (strong scaling), or (2) a proportional increase in the problem size that can be tackled in the same execution time (weak scaling). In reference to a computer architecture: having the property that all nodes are an equal communication "distance" apart, and that sufficient routes exist to avoid bottlenecks.
ScaLAPACK Scalable LAPACK, a subset of LAPACK targeted for distributed memory parallel computers. The current software is SPMD in style and uses MPI for message passing. It assumes a two-dimensional block cyclic decomposition for matrices.
scalar processor Historically, a single-instruction, single-data processor that only operates on one data item at a time. Such processors have become largely obsolete in the current era of multi-core processors with multiple functional units in each CPU.
scatter/gather Terms for two common collective operations in message passing. A scatter operation takes N identified chunks of data from a single dataset on one node and scatters them out to N nodes; a gather operation does the inverse, i.e., it takes N chunks of data from N nodes and gathers them into a single dataset on one node.
scheduler The part of a computer system's software that determines which task is assigned to each system resource at any given time. For example, in a batch system that maintains a queue of jobs waiting to run, the scheduler determines when each job can start based on such criteria as the order in which job requests were submitted and the availability of the system resources needed by each job.
script A computer program that is executed line by line and does not need to be compiled.
segmented scan Modification of a prefix sum (scan) with an equal sized array of flag bits to demarcate the segment boundaries on which smaller scans should be performed.
semaphore A variable or typed object that is used to keep track of the availability of a set of common resources in a parallel environment. Requesting a resource decrements the semaphore, and freeing a resource increments it. A thread/process can gain access to a resource as long as the semaphore is greater than 0, otherwise it must wait.
sequential consistency A property of a multiprocessor, which asserts that the result of executing a set of parallel instructions is the same as if all processors executed the instructions in some sequential order.
serial processing A program that processes all of its data in a single thread. Also referred to as "sequential processing."
serialize The act of converting a data structure, or in object-oriented programming, an object, to a format that may be stored in a file so that it may be reused (deserialized) later.
Service Units (SUs) The units in which computing resource usage is measured, for the purposes of granting allocations and tracking consumption. The definition of an SU varies from provider to provider, and it generally depends on resource type.
SGEMM/DGEMM A subroutine in the Basic Linear Algebra Subprograms (BLAS) that performs a single/double precision, general, matrix-matrix multiplication.
shared memory Access to the same memory addresses by multiple threads in a process. If one thread changes the data that are shared, special mechanisms must be used to ensure that if a different thread accesses the data, it gets the value that the programmer intends.
shared memory computer A computer in which multiple processors have access to the same global memory in a unified address space. Speed of access can be equal for all (SMP) or nonuniform (NUMA). The two biggest challenges faced by a shared memory architecture are avoiding memory contention and maintaining cache coherence.
shell A command line interface to the kernel. In Linux or Unix, C shell (csh, tcsh) and Bash (bash) are the most common shells.
SIMD Single Instruction Multiple Data. It describes the instructions and/or hardware functional units that enable one operation to be performed on multiple data items simultaneously. In some cases, even better performance can be achieved with hardware that is capable of pipelining. At a higher level, SIMD can mean an architecture in which one "control processor" issues instructions to other processors.
SIMT Single Instruction Multiple Threads, NVIDIA'S term for the architecture of their GPUs. Threads are created, managed, scheduled, and executed in groups of 32 parallel threads, called warps. Ref. - Lindholm, E., Nickolls, J., Oberman, S., Montrym, J., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 28(2), 39-55 (2008).
Skylake A line of Intel processors that in 2017 grew to include Xeon Scalable Performance (SP) models used in HPC systems. Skylake SP processors can hold up to 28 cores supporting the AVX-512 instruction set. The cores are interconnected by a 2D mesh similar to what is found in Intel Xeon Phi. Sometimes abbreviated SKX.
SLURM The Slurm Workload Manager (formerly the Simple Linux Utility for Resource Management) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux (and several other UNIX OS) clusters.
SMP Symmetric MultiProcessor. A computer or computing device in which two or more processors are connected to a single shared main memory and can be controlled by a single OS instance.
SMPSs The SMP Superscalar (SMPSs) framework, which tries to parallelize the execution of any functions that have been specially annotated by the programmer. Calls to these annotated functions must be completely independent of one another.
software stack A complete collection of software to accomplish a defined goal or group of goals.
solid state drive (SSD) A device which uses integrated-circuit assemblies as memory for data storage, rather than electromechanical magnetic disks. Compared to traditional hard disk drive (HDD) storage technology, solid-state storage has considerably faster data retrieval, lower latency, better resistance to physical shock, and quieter operation.
sorting An algorithm to put the items of a list or sequence into an order consistent with some specified criterion, for example, numerical or alphabetical order.
sparse linear algebra Linear algebra operations involving vectors and matrices that contain nearly all zero values. Sparse matrices are stored in various compact formats to avoid wasting memory and to speed up memory accesses. The specific storage format can depend on the type of problem being solved.
SPEC Standard Performance Evaluation Corporation, a group that produces and standardizes benchmark suites that are pertinent to some major classes of applications. For HPC, the best-known suites are SPECint and SPECfp, which are based on actual applications, such as the GNU compiler for the former, and scientific programs like GAMESS (chemistry) and WRF (weather) for the latter. Also relevant to HPC are the benchmarks contained in SPEC MPI and SPEC OMP (OpenMP).
spectral method A technique in which approximate solutions to a continuum model are computed through the use of Fourier transforms. Each Fourier component is like a wave of fixed wavelength, so a set of components is called a "spectrum".
speculative parallelization An approach to parallel computing in which a parallel task computes its results under the tentative assumption that no further inputs will be received; if new inputs do arrive, then the results must be re-computed. Also called optimistic parallelization.
speedup A measure of how much faster a given program runs when executed in parallel on several processor cores as compared to serial execution on a single processor core. Speedup is defined as S = sequential run time / parallel run time.
spinlock A lock that causes a requesting thread to wait in a loop ("spin") and repeatedly check until the lock becomes available. This makes the waiting thread appear to be busy.
split-phase barrier In the UPC programming language, a non-blocking barrier. The barrier is split into two parts, upc_notify followed by upc_wait. No thread passes the upc_wait until all threads have executed upc_notify.
SPMD SPMD is an acronym for Single Program, Multiple Data. All processes run the same program, operating on different data. This model is particularly appropriate for problems with a regular, predictable communication pattern. Codes using this model tend to be scalable if global communication is avoided.
SpMV Sparse Matrix-Vector multiplication. A commonly-used computational kernel.
SQL Structured Query Language, a commonly-used programming language designed for managing and accessing the data held in a relational database.
SRAM Static Random-Access Memory. In contrast to dynamic RAM (DRAM), which must be periodically refreshed, SRAM exhibits data remanence. But it is still volatile in the conventional sense that data is eventually lost when the memory is not powered. In SRAM, one bit of data is stored using 6 or more transistors, whereas DRAM requires 1 transistor plus a capacitor, allowing higher memory density on a single chip.
SSE Streaming SIMD Extensions, the SIMD instruction set expansion for x86 architectures. SSE was originally designed by Intel to support vector-processing capability for single-precision, floating-point data, which is heavily used in DSP (Digital Signal Processing) and computer graphics. SSE has by now advanced through several generations, named SSE2, SSE3, etc. These have added support for double-precision and integer processing.
stack A region of computer memory where a thread adds or removes data in a last-in-first-out manner. A thread's stack is always used to store the location of a function call in order to allow a return statement in the function to return to the correct location. Programmers may also opt to "push" state data for a function onto the top of the stack. Upon function exit, the relevant portion of the stack is "popped" or cleared.
state All the stored information to which a program has access at a given point in time.
stencil In numerical analysis, the logical subgroup of nodes in a mesh that are used to compute a numerical approximation at a point of interest.
store and forward A networking technique in which a message is sent to an intermediate node to be stored and verified for integrity, before it is forwarded to another intermediate node or to its final destination.
stream Input and Output to/from Linux programs. Three streams are defined: STDIN (standard input from the keyboard), STDOUT (standard output sent to the screen) and STDERR (error output sent to the screen). Any of these can be redirected to a file with the redirection operators (<, > and >>).
Streaming Multiprocessor (SM) In NVIDIA's GPU architecture, a hardware element that is analogous to a CPU core. An NVIDIA Tesla V100 GPU has 80 SMs, each containing multiple sets of 32 "CUDA cores" (64/64/32 for int/float/double precision).
strength reduction A code optimization technique in which a costly operation is replaced with an equivalent, less costly one; for example, replacing an integer exponentiation with a series of multiplications.
stride The number of data locations between successive accesses to a vector, array, or other contiguous data structure. For performance reasons, stride-one or unit stride is very desirable, because fetching a cache line will bring in multiple useable entries (i.e., accesses are dense).
string A sequence of characters, stored either as a literal constant or as some kind of variable.
superlinear speedup Parallel speedup that is greater than N for N CPUs (or N cores), so it exceeds the best speedup that would normally be expected. Typically such speedup occurs when the amount of memory needed by each task becomes small enough to fit within the cache of a single CPU, which dramatically shrinks the memory access time.
superscalar execution A type of instruction-level parallelism in which the CPU simultaneously dispatches instructions to multiple functional units, with the result that more than one instruction can be executed during a clock cycle. A functional unit is not the same as a core, but is rather an execution resource within a core, such as an arithmetic logic unit or a floating point unit.
support vector machine (SVM) A machine learning model that represents the training data as points in a high-dimensional space and constructs one or more hyperplanes to divide the space linearly for classification purposes.
swap space A disk partition or file that is used to store pages (typically 4 KB blocks) of virtual memory that do not fit into the main memory or RAM of the computer.
synchronization The act of bringing two or more processes to known points in their execution at the same clock time. Explicit synchronization is often necessary in SPMD and MIMD programs. The time wasted by processes waiting for other processes to synchronize with them can be a major source of inefficiency in parallel programs.
synchronization (multithreading) The use of a lock or similar mutex mechanism to ensure that no two concurrently-executing threads or processes execute some critical section of a program at the same time. Threads must take turns executing this serialized portion of the program, i.e., no other thread may execute this portion until the current thread is finished with it.
task A single instance of an executable in a parallel program. A common way to parallelize a program is to divide up its work among many tasks and to assign tasks to nodes in a cluster computer. The threads in each task have access to a virtual memory address space that is unique to the task, so shared-memory programming can be used within a task.
task dependency A scheduling constraint that arises when a task that could potentially be done in parallel with other tasks cannot start until one (or more) of the other tasks completes.
task graph A representation of a computer program as a directed graph in which the vertices are tasks and the directed edges are dependencies between the tasks. Parallelization of the program is equivalent to partitioning the graph.
TBB Threading Building Blocks. Intel's C++ template library for creating programs that take advantage of multi-core processors.
thread A portion of a process (running program) that is executing a sequence of instructions. It shares a virtual address space with other threads in the same process, but it has its own instruction pointer and a separate stack. Threads run independently of each other, so if synchronization between them is required, it must be done explicitly. On multicore processors, the operating system will schedule threads on all available resources, provided there are enough threads ready to run. If there are more threads than resources, the operating system will multiplex the threads, i.e., it will let one run for a while and then let a different one run if it is ready and waiting.
thread safety A property that ensures that a piece of code or a function will work correctly when it is run by multiple threads at once.
thread-block A group of threads that operates collectively on an array that is being processed by a GPU. The operation is defined by a CUDA kernel function. All threads in the block are processed on the same GPU SM (streaming multiprocessor).
throughput The rate at which data are processed or solutions to a given problem are computed by a computing system.
Thrust (GPU) NVIDIA's high-level parallel computing library for CUDA. It mimics the C++ Standard Template Library (STL).
tiling An optimization technique for certain matrix operations; also called blocking. If an NxN matrix is subdivided into nxn blocks or tiles of size bxb, so that one bxb block fits into some fraction of cache, then data re-use is greatly enhanced.
TLB Translation Lookaside Buffer, a cache of physical addresses that the memory management hardware uses to improve virtual address translation speed. All current desktop, notebook, and server processors use a TLB to map virtual and physical address spaces, and it is nearly always present in any hardware which utilizes virtual memory. A TLB miss results in a slow search of the entire page table.
transistors The basic building blocks of an integrated circuit in a computer chip. Each transistor is a tiny semiconductor device that is acts as a switch or logic gate. At least one transistor is needed to store one data bit; more are needed if the memory is non-volatile.
tree decomposition A domain decomposition technique used for adaptive grid refinement in simulated systems where the data points tend to cluster or to form steep gradients. The natural data structure for such a system is a tree in which a coarse-level node can have 2^n child nodes at a more refined level, where n is the dimensionality of the simulated system. Thus, in a 1, 2, or 3 dimensional system, the tree data structure would be a binary tree, quadtree, or octree respectively.
tridiagonal matrix Matrix whose entries are zero in all locations except on three diagonals: the main diagonal and the diagonals immediately above and below it.
unit testing The process of verifying that all the individual parts of a computer code are working correctly, by doing separate, isolated tests with driver programs that provide valid input data and check for valid output.
UPC Unified Parallel C, an extension of C designed for large-scale parallel machines that have both shared and distributed memory (e.g., a cluster of SMPs). Typically, a single thread runs on each CPU core. UPC is a PGAS language.
variable In computer programming, a storage location and an associated symbolic name which contains some known or unknown quantity or information.
vector lane In a vector processor, a vector functional unit that can be assigned to one thread. The set of all vector lanes in the processor can be assigned to one thread or to multiple threads as needed, based on the SIMD (vector) workload.
vector processors Functional units within a CPU or GPU that can operate on several data items simultaneously. Vector operations are typically initiated through a special SIMD instruction set, e.g., Intel’s SSE. Also called vector processing units or VPUs.
vectorization A type of parallelism in which specialized vector hardware units, or vector processing units (VPUs), perform numerical operations concurrently on fixed-size arrays, rather than on single elements. See SIMD.
verification and validation (V&V) The process of checking that a software system meets its requirements and specifications (verification) and that it fulfills its intended purpose (validation).
victim cache A cache used to hold blocks evicted from a CPU cache upon replacement. The victim cache lies between the main cache and its refill path. Since it only contains blocks that were evicted from the main cache, it promotes faster re-accesses of data.
virtual machine An operating system and associated virtual hardware components that are implemented in software and run on a hypervisor instead of directly communicating with hardware.
virtual memory A memory-management technique that virtualizes a computer's various forms of computer data storage (such as RAM and disk storage). This allows a program to be written as though there is only one kind of memory, "virtual" memory, which behaves like contiguous, directly-addressable read/write memory. The basic unit of memory in the VM is called a page, and its size is often 4KB.
visualization In the broadest sense, visualization is the art or science of transforming information to a form "comprehensible" by the sense of sight. Visualization is broadly associated with graphical display in the form of pictures (printed or photo), workstation displays, or video.
Viterbi algorithm A dynamic programming algorithm for determining the most likely sequence of unobserved states that results in a sequence of observed events. Its most common application is in hidden Markov models (HMMs). The algorithm works its way backwards from the final state, based on the recurrence relation that describes the most probable sequence of states from the HMM.
VLIW Very Long Instruction Word, a processor architecture designed to increase instruction-level parallelism at the CPU level by issuing long instructions that comprise multiple operations, one operation for each execution unit in the CPU.
warp In NVIDIA GPU architecture, a group of threads that can be scheduled as a unit, with a typical size of 32. All threads in a warp run on the same streaming multiprocessor (SM).
word The memory required to store one number. For floating-point numbers, a single-precision word is typically 4 bytes. A double-precision word (often called a doubleword) is 8 bytes.
workflow A predefined sequence of operations; especially, a series of high-level computational steps that is mediated, controlled, and carried out by a software system.
workpile pattern A parallel programming model in which the work consists of an unordered pile of tasks, and the computation is carried out by a set of workers that take one task at a time from the workpile, do the work, and return for another task until all are completed. Most often the workpile is ordered arbitrarily and represented as a queue. The workpile pattern achieves dynamic load balance with good scalability, assuming there is no communication between tasks. Also known as "bag of tasks", "task farming", or "thread pool pattern". When the workpile and the results are managed by a distinct process, it becomes the well-known "manager-worker" pattern.
wormhole routing A switching technique that breaks large packets into small flow-control digits, or flits. The first flit of a packet contains routing information that is used to set up the routing behavior for all subsequent flits of that packet.
Xeon Phi Intel's product line of processors and coprocessors based on the Many Integrated Core (MIC) architecture. In addition to many cores, these products possess extra capabilities for vector processing, large shared L2 caches, and fast local RAM within the card or package.
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement