So the target architecture is the Stampede2 cluster, which hosts 4200 KNL compute nodes, and each compute node has 68 cores, with 4 hardware threads each. So that's a total of 272 potential processors per node. This particular application was run on the KNLs configured in a cache quadrant mode, in which the fast MCDRAM is managed as a Level 3 cache, and the tiles (there are two cores per tile), they're arranged in quadrants in which the addresses are hashed to a directory which is in the same quadrant as memory. Now the KNLs have a clock rate of 1.4 GHz which is roughly half of, say, Xeon processors like Sandy Bridge. So they're a little slower but there are a lot more cores per node. There are 16 GB of high-speed MCDRAM so that translates into a pretty large Level 3 cache. Each core has a 32 KB L1 cache, however, the L2 cache of 1 MB is shared between two cores in the tile. So if we look a little closer at the Knight's Landing node, you can see these 36 tiles, they're connected by a 2D mesh interconnect, and each tile as I mentioned, has two cores and each core has two vector processing units and four threads per core and as I mentioned, the L2 cache is shared between cores on the tile. The vector processing units are 512 bit, so that translates to t32 single precision or 16 double precision numbers per unit. And they're compatible with a number of vector instruction sets, including back-compatible with AVX1, 2, and SSE.