Using source_node

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≪ Previous: Cancellation Without An Exception

An active source_node starts sending messages as soon as an edge is connected to it. If not properly managed, this can lead to dropped messages. By default, a source_node is active unless constructed in the inactive state:

template< typename Body > source_node( graph &g, Body body, bool is_active=true )

To activate an inactive source_node, you call the node's function activate:

    source_node< int > src( g, src_body(10), false );
    // use it in calls to make_edge…
    src.activate();

To manage this, either all source_nodes should be constructed in the inactive state and then only activated after the entire flow graph is constructed, or else care must be taken to build the graph so that no messages are dropped.

For example, you can use the code in Data Flow Graph. In that implementation, the source_node is constructed in the inactive state and activated after all other edges are made:

      make_edge( squarer, summer );
      make_edge( cuber, summer );
      source_node< int > src( g, src_body(10), false );
      make_edge( src, squarer );
      make_edge( src, cuber );
      src.activate();
      g.wait_for_all();

In this example, if the source_node were constructed in the active state, it might send a message to squarer immediately after the edge to squarer is connected. Later, when the edge to cuber is connected, cuber will receive all future messages, but may have already missed some.

In general it is safest to create your source_nodes in the inactive state and then activate them after the whole graph is constructed. However, this approach serializes graph construction and graph execution.

Some graphs can be constructed safely with source_nodes active, allowing the overlap of construction and execution. If your graph is a directed acyclic graph (DAG), and each source_node has only one successor, you can construct your source_nodes in the active state if you construct the edges in reverse topological order; that is, make the edges at the largest depth in the tree first, and work back to the shallowest edges. For example, if src is a source_node and func1 and func2 are both function nodes, the following graph would not drop messages, even though src is constructed in the active state:

    const int limit = 10;
    int count = 0;
    graph g;
    source_node<int> src( g, [&]( int &v ) -> bool {
      if ( count < limit ) {
        ++count;
        v = count;
        return true;
      } else {
        return false;
      }
    } );
    function_node<int,int> func1( g, 1, []( int i ) -> int {
      cout << i << "\n";
      return i;
    } );
    function_node<int,int> func2( g, 1, []( int i ) -> int {
      cout << i << "\n";
      return i;
    } );

    make_edge( func1, func2 );
    make_edge( src, func1 );

    g.wait_for_all();

The above code is safe because the edge from func1 to func2 is made before the edge from src to func1. If the edge from src to func1 were made first, func1 might generate a message before func2 is attached to it; that message would be dropped. Also, src has only a single successor. If src had more than one successor, the successor that is attached first might receive messages that do not reach the successors that are attached after it.

Parent topic: Flow Graph Tips on Making Edges

Nodes

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Bandwidth and Cache Affinity

≪ Previous: Using source_node

A node is a class that inherits from tbb::flow::graph_node and also typically inherits from tbb::flow::sender<T> , tbb::flow::receiver<T> or both. A node performs some operation, usually on an incoming message and may generate zero or more output messages. Some nodes require more than one input message or generate more than one output message.

While it is possible to define your own node types by inheriting from graph_node, sender and receiver, it is more typical that predefined node types are used to construct a graph. The list of predefined nodes is available from the See Also section below.

A function_node is a predefined type available in flow_graph.h and represents a simple function with one input and one output. The constructor for a function_node takes three arguments:

template< typename Body> function_node(graph &g, size_t concurrency, Body body)

Parameter	Description
Body	Type of the body object.
g	The graph the node belongs to.
concurrency	The concurrency limit for the node. You can use the concurrency limit to control how many invocations of the node are allowed to proceed concurrently, from 1 (serial) to an unlimited number.
body	User defined function object, or lambda expression, that is applied to the incoming message to generate the outgoing message.

Below is code for creating a simple graph that contains a single function_node. In this example, a node n is constructed that belongs to graph g, and has a second argument of 1, which allows at most 1 invocation of the node to occur concurrently. The body is a lambda expression that prints each value v that it receives, spins for v seconds, prints the value again, and then returns v unmodified. The code for the function spin_for is not provided.

    graph g;
    function_node< int, int > n( g, 1, []( int v ) -> int {
        cout << v;
        spin_for( v );
        cout << v;
        return v;
    } );

After the node is constructed in the example above, you can pass messages to it, either by connecting it to other nodes using edges or by invoking its function try_put. Using edges is described in the next section.

    n.try_put( 1 );
    n.try_put( 2 );
    n.try_put( 3 );

You can then wait for the messages to be processed by calling wait_for_all on the graph object:

    g.wait_for_all();

In the above example code, the function_node n was created with a concurrency limit of 1. When it receives the message sequence 1, 2 and 3, the node n will spawn a task to apply the body to the first input, 1. When that task is complete, it will then spawn another task to apply the body to 2. And likewise, the node will wait for that task to complete before spawning a third task to apply the body to 3. The calls to try_put do not block until a task is spawned; if a node cannot immediately spawn a task to process the message, the message will be buffered in the node. When it is legal, based on concurrency limits, a task will be spawned to process the next buffered message.

In the above graph, each message is processed sequentially. If however, you construct the node with a different concurrency limit, parallelism can be achieved:

    function_node< int, int > n( g, tbb::flow::unlimited, []( int v ) -> int {
        cout << v;
        spin_for( v );
        cout << v;
        return v;
    } );

You can use unlimited as the concurrency limit to instruct the library to spawn a task as soon as a message arrives, regardless of how many other tasks have been spawned. You can also use any specific value, such as 4 or 8, to limit concurrency to at most 4 or 8, respectively. It is important to remember that spawning a task does not mean creating a thread. So while a graph may spawn many tasks, only the number of threads available in the library's thread pool will be used to execute these tasks.

Suppose you use unlimited in the function_node constructor instead and call try_put on the node:

    n.try_put( 1 );
    n.try_put( 2 );
    n.try_put( 3 );
    g.wait_for_all();

The library spawns three tasks, each one applying n's lambda expression to one of the messages. If you have a sufficient number of threads available on your system, then all three invocations of the body will occur in parallel. If however, you have only one thread in the system, they execute sequentially.

Parent topic: Basic Flow Graph Concepts

Bandwidth and Cache Affinity

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Threading Building Blocks (Intel® TBB) User Guide

≪ Previous: Nodes

For a sufficiently simple function Foo, the examples might not show good speedup when written as parallel loops. The cause could be insufficient system bandwidth between the processors and memory. In that case, you may have to rethink your algorithm to take better advantage of cache. Restructuring to better utilize the cache usually benefits the parallel program as well as the serial program.

An alternative to restructuring that works in some cases is affinity_partitioner. It not only automatically chooses the grainsize, but also optimizes for cache affinity and tries to distribute the data uniformly among threads. Using affinity_partitioner can significantly improve performance when:

The computation does a few operations per data access.
The data acted upon by the loop fits in cache.
The loop, or a similar loop, is re-executed over the same data.
There are more than two hardware threads available (and especially if the number of threads is not a power of two). If only two threads are available, the default scheduling in Intel® Threading Building Blocks (Intel® TBB) usually provides sufficient cache affinity.

The following code shows how to use affinity_partitioner.

#include "tbb/tbb.h"

void ParallelApplyFoo( float a[], size_t n ) {
    static affinity_partitioner ap;
    parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a), ap);
}

void TimeStepFoo( float a[], size_t n, int steps ) {
    for( int t=0; t<steps; ++t )
        ParallelApplyFoo( a, n );
}

In the example, the affinity_partitioner object ap lives between loop iterations. It remembers where iterations of the loop ran, so that each iteration can be handed to the same thread that executed it before. The example code gets the lifetime of the partitioner right by declaring the affinity_partitioner as a local static object. Another approach would be to declare it at a scope outside the iterative loop in TimeStepFoo, and hand it down the call chain to parallel_for.

If the data does not fit across the system’s caches, there may be little benefit. The following figure shows the situations.

Benefit of Affinity Determined by Relative Size of Data Set and Cache

The next figure shows how parallel speedup might vary with the size of a data set. The computation for the example is A[i]+=B[i] for i in the range [0,N). It was chosen for dramatic effect. You are unlikely to see quite this much variation in your code. The graph shows not much improvement at the extremes. For small N, parallel scheduling overhead dominates, resulting in little speedup. For large N, the data set is too large to be carried in cache between loop invocations. The peak in the middle is the sweet spot for affinity. Hence affinity_partitioner should be considered a tool, not a cure-all, when there is a low ratio of computations to memory accesses.

Improvement from Affinity Dependent on Array Size

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Parent topic: parallel_for

Language English

↧

Intel® Threading Building Blocks (Intel® TBB) User Guide

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Parallel Studio XE 2016 Beta program has started!

≪ Previous: Bandwidth and Cache Affinity

Intel® TBB 4.3 - Windows* OS

Parent topic: Intel® Threading Building Blocks

Language English

↧

Intel® Parallel Studio XE 2016 Beta program has started!

April 10, 2015, 8:08 am

Latest and popular articles on Intel Technologies

≫ Next: Construction

≪ Previous: Intel® Threading Building Blocks (Intel® TBB) User Guide

The Intel® Parallel Studio XE 2016 Beta program is now available!

In this beta test, you will have early access to Intel® Parallel Studio XE 2016 products and the opportunity to provide feedback to help make our products better. Registration is easy through the pre-Beta survey site.

This suite of products brings together exciting new technologies along with improvements to Intel’s existing software development tools:

Expanded Standards and Features– Scaling Development Efforts Forward
Additional language support for C11 and C++14, Fortran 2008 Submodules and IMPURE ELEMENTAL, and C Interoperability from Fortran 2015, and OpenMP* 4.1 TR 3. New support for SIMD operator use with SSE integer types, Intel® Cilk™ Plus combined Parallel and SIMD loops, OpenMP* 4.0 user-defined reductions (C++ only), enhanced uninitialized variable detection (Fortran only), feature improvements to Intel’s Language Extensions for Offload, annotated source listings, and a new directory structure. All available in the Intel® C/C++ and Fortran Compiler 16.0 Beta.
Vectorization– Boost Performance by Utilizing Vector Instructions/Units
Vectorization Advisor identifies new vectorization opportunities as well as improvements to existing vectorization and highlights them in your code. It makes actionable coding recommendations to boost performance and estimates the speedup. Available in the new Intel® Advisor XE 2016 Beta!
Big Data Analytics– Easily Build IA Optimized Data Analytics Application
Intel® Data Analytics Acceleration Library (DAAL) 2016 Beta will help data scientists speed through big data challenges with optimized IA functions
The Intel® Math Kernel Library 11.3 Beta introduces Inspector - Executor API for Sparse BLAS: a new two-stage API for sparse Matrix Vector Multiplication format, as well as MPI wrappers to support custom MPI Implementations
The Intel® Integrated Performance Primitives 9.0 Beta adds new APIs to support external threading – a feature which allows users to choose different threading approaches for the applications
Scalable MPI Analysis – Fast & Lightweight Analysis for 32K+ Ranks
Intel® Trace Analyzer and Collector 9.1 Beta adds a new MPI Performance Snapshot feature for easy-to-use, scalable MPI statistics collection and analysis of large MPI jobs to identify areas for improvement
Enhanced OpenMP* analysis and MPI+OpenMP multi-rank analysis
Intel® VTune™ Amplifier 2016 Beta adds OpenMP parallelization inefficiency, imbalance and work sharing analysis to tune for more efficient use of parallel regions. It also now supports multi-rank analysis of MPI compute nodes with or without OpenMP use.

If you are ready to get started, follow this link to complete the pre-beta survey, register, and download the beta software:

Intel® Parallel Studio XE 2016 Pre-Beta Survey

For more details and information about this beta program, check out the Intel® Parallel Studio XE 2016 Beta page, which includes additional information in the FAQ and What’s New documents.

As a Beta tester, you’ll be expected to provide feedback to our development teams via Beta surveys and submissions at Intel® Premier Customer Support.

↧

Construction

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Parallel Iteration

≪ Previous: Intel® Parallel Studio XE 2016 Beta program has started!

Caution

These operations must not be invoked concurrently on the same vector.

The following table provides additional information on the members of this template class.

Member	Description
concurrent_vector( const allocator_type& a = allocator_type() )	Constructs an empty vector using optionally specified allocator instance.
concurrent_vector( size_type n, const_reference t=T(), const allocator_type& a = allocator_type() );	Constructs a vector of `n` copies of `t`, using optionally specified allocator instance. If `t` is not specified, each element is default-constructed instead of copied.
template<typename InputIterator> concurrent_vector( InputIterator first, InputIterator last, const allocator_type& a = allocator_type() )	Constructs a vector that is a copy of the sequence `[first,last)`, making only `N` calls to the copy constructor of `T`, where `N` is the distance between `first` and `last`.
concurrent_vector( std::initializer_list<T> il, const allocator_type &a = allocator_type()) )	C++11 specific. Equivalent to `concurrent_vector( il.begin(), il.end(), a)` .
concurrent_vector( const concurrent_vector& src )	Constructs a copy of `src`.
concurrent_vector(concurrent_vector&& src )	C++11 specific; Constructs a new vector by moving content from `src`. `src` is left in an unspecified state, but can be safely destroyed.
concurrent_vector( concurrent_vector&& src, const allocator_type& a )	C++11 specific; Constructs a new vector by moving content from `src` using allocator `a`. `src` is left in an unspecified state, but can be safely destroyed.
concurrent_vector& operator=( const concurrent_vector& src )	Assigns the contents of `src` to `this`. Returns: a reference to `this`.
concurrent_vector& operator=(concurrent_vector&& src);	C++11 specific; Moves data from `src` to `this`. `src` is left in an unspecified state, but can be safely destroyed. Returns: a reference to `this`.
template<typename M> concurrent_vector& operator=( const concurrent_vector<T, M>& src )	Assigns the contents of `src` to `this`. Returns: a reference to `this`.
concurrent_vector& operator=( std::initializer_list<T> il )	C++11 specific. Sets `this` to contain data from `il`. Returns: a reference to `this`.
void assign( size_type n, const_reference t )	Assigns `n` copies of `t` to `*this`.
template<class InputIterator> void assign( InputIterator first, InputIterator last )	Assigns the contents of the sequence `[first,last)`, making only `N` calls to the copy constructor of `T`, where `N` is the distance between `first` and `last`.
void assign( std::initializer_list<T> il )	C++11 specific. Equivalent to `assign(il.begin(), il.end())`.

Parent topic: concurrent_vector

Language English

↧

Parallel Iteration

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Construct

≪ Previous: Construction

Types const_range_type and range_type model the Container Range concept. The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator.

The following table provides additional information on the members of this template class.

Member	Description
const_range_type range() const	Returns: `const_range_type` object representing all keys in the table.
range_type range()	Returns: `range_type` object representing all keys in the table.

Parent topic: concurrent_unordered_set and concurrent_unordered_multiset Template Classes

Construct

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: parallel_scan Template Function

≪ Previous: Parallel Iteration

The following tables provide information on the members of the concurrent_unordered_map and concurrent_unordered_multimap template classes.

`concurrent_unordered_map`

Member	Description
`explicit concurrent_unordered_map(size_type n = <implementation-defined>, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type())`	Constructs a table with `n` buckets.
`template <typename InputIterator> concurrent_unordered_map (InputIterator first, InputIterator last, size_type n = <implementation-defined>, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type())`	Constructs a table with `n` buckets initialized with `value_type(*i)` where `i` is in the half open interval `[first,last)`.
`concurrent_unordered_map(const concurrent_unordered_map& m)`	Constructs a copy of the map `m`.
`concurrent_unordered_map(const Alloc& a)`	Constructs an empty map using allocator `a`.
`concurrent_unordered_map(const concurrent_unordered_map& m, const Alloc& a)`	Constructs a copy of the map `m` using allocator `a`.
`concurrent_unordered_map(std::initializer_list<value_type> il, size_type n = <implementation-defined>, const Hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type())`	C++11 specific; Equivalent to `concurrent_unordered_map(il.begin(), il.end(), n, hf, eql, a)` .
`concurrent_unordered_map(concurrent_unordered_map&& m)`	C++11 specific; Constructs a new table by moving content from m. m is left in an unspecified state, but can be safely destroyed.
`concurrent_unordered_map(concurrent_unordered_map&& m, const Alloc& a)`	C++11 specific; Constructs a new table by moving content from m using specifed allocator. m is left in an unspecified state, but can be safely destroyed.
`~concurrent_unordered_map()`	Destroys the map.
`concurrent_unordered_map& operator=(const concurrent_unordered_map& m);`	Assigns contents of `m` to `this`. Returns: a reference to `this`.
`concurrent_unordered_map& operator=(concurrent_unordered_map&& m);`	C++11 specific; Moves data from `m` to `this`. m is left in an unspecified state, but can be safely destroyed. Returns: a reference to `this`.
`concurrent_unordered_map& operator=(std::initializer_list<value_type> il);`	C++11 specific; Assigns contents of `il` to `this`. Returns: a reference to `this`.
`allocator_type get_allocator() const;`	Returns a copy of the allocator associated with `*this`.

`concurrent_unordered_multimap`

Member	Description
`explicit concurrent_unordered_multimap(size_type n = <implementation-defined>, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type())`	Constructs a table with `n` buckets.
`template <typename InputIterator> concurrent_unordered_multimap(InputIterator first, InputIterator last, size_type n = <implementation-defined>, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type())`	Constructs a table with `n` buckets initialized with `value_type(*i)` where `i` is in the half open interval `[first,last)`.
`concurrent_unordered_multimap(const concurrent_unordered_multimap& m)`	Constructs a copy of the multimap `m`.
`concurrent_unordered_multimap(const Alloc& a)`	Constructs empty multimap using allocator `a`.
`concurrent_unordered_multimap(const concurrent_unordered_multimap&, const Alloc& a)`	Constructs a copy of the multimap `m` using allocator `a`.
`concurrent_unordered_multimap(std::initializer_list<value_type> il, size_type n = <implementation-defined>, const Hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type())`	C++11 specific; Equivalent to `concurrent_unordered_multimap(il.begin(), il.end(), n, hf, eql, a)` .
`concurrent_unordered_multimap(concurrent_unordered_multimap&& m)`	C++11 specific; Constructs a new table by moving content from m. m is left in an unspecified state, but can be safely destroyed.
`concurrent_unordered_multimap(concurrent_unordered_multimap&& m, const Alloc& a)`	C++11 specific; Constructs a new table by moving content from m using specifed allocator. m is left in an unspecified state, but can be safely destroyed.
`~concurrent_unordered_multimap()`	Destroys the multimap.
`concurrent_unordered_multimap& operator=(const concurrent_unordered_multimap& m);`	Assigns contents of `m` to `this`. Returns: a reference to `this`.
`concurrent_unordered_multimap& operator=(concurrent_unordered_multimap&& m);`	C++11 specific; Moves data from `m` to `this`. m is left in an unspecified state, but can be safely destroyed. Returns: a reference to `this`.
`concurrent_unordered_multimap& operator=( std::initializer_list<value_type> il);`	C++11 specific; Assigns contents of `il` to `this`. Returns: a reference to `this`.
`allocator_type get_allocator() const;`	Returns a copy of the allocator associated with `*this`.

Parent topic: concurrent_unordered_map and concurrent_unordered_multimap Template Classes

Language English

↧

parallel_scan Template Function

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Feature Macros

≪ Previous: Construct

Summary

Template function that computes parallel prefix.

Header

#include "tbb/parallel_scan.h"

Syntax

template<typename Range, typename Body>
void parallel_scan( const Range& range, Body& body );

template<typename Range, typename Body>
void parallel_scan( const Range& range, Body& body, const auto_partitioner& );

template<typename Range, typename Body>
void parallel_scan( const Range& range, Body& body, const simple_partitioner& );

Description

A parallel_scan(range,body) computes a parallel prefix, also known as parallel scan. This computation is an advanced concept in parallel computing that is sometimes useful in scenarios that appear to have inherently serial dependences.

A mathematical definition of the parallel prefix is as follows. Let × be an associative operation with left-identity element id_×. The parallel prefix of × over a sequence z₀, z₁, ...z_n-1 is a sequence y₀, y₁, y₂, ...y_n-1 where:

y₀ = id_×× z₀
y_i = y_i-1× z_i

For example, if × is addition, the parallel prefix corresponds a running sum. A serial implementation of parallel prefix is:

T temp = id_×;
for( int i=1; i<=n; ++i ) {
    temp = temp × z[i];
    y[i] = temp;
}

Parallel prefix performs this in parallel by reassociating the application of × and using two passes. It may invoke × up to twice as many times as the serial prefix algorithm. Given the right grain size and sufficient hardware threads, it can out perform the serial prefix because even though it does more work, it can distribute the work across more than one hardware thread.

Tip

Because parallel_scan needs two passes, systems with only two hardware threads tend to exhibit small speedup. parallel_scan is best considered a glimpse of a technique for future systems with more than two cores. It is nonetheless of interest because it shows how a problem that appears inherently sequential can be parallelized.

The template parallel_scan<Range,Body> implements parallel prefix generically. It requires the signatures described in the table below.

parallel_scan Requirements
Pseudo-Signature	Semantics
`void Body::operator()( const Range& r, pre_scan_tag )`	Accumulate summary for range `r` .
`void Body::operator()( const Range& r, final_scan_tag )`	Compute scan result and summary for range `r`.
`Body::Body( Body& b, split )`	Split `b` so that `this` and `b` can accumulate summaries separately. Body `*this` is object `a` in the table row below.
`void Body::reverse_join( Body& a )`	Merge summary accumulated by `a` into summary accumulated by `this`, where `this` was created earlier from `a` by `a`'s splitting constructor. Body `*this` is object `b` in the table row above.
`void Body::assign( Body& b )`	Assign summary of `b` to `this`.

A summary contains enough information such that for two consecutive subranges r and s:

If r has no preceding subrange, the scan result for s can be computed from knowing s and the summary for r.
A summary of r concatenated with s can be computed from the summaries of r and s.

For example, if computing a running sum of an array, the summary for a range r is the sum of the array elements corresponding to r.

The figure below shows one way that parallel_scan might compute the running sum of an array containing the integers 1-16. Time flows downwards in the diagram. Each color denotes a separate Body object. Summaries are shown in brackets.

The first two steps split the original blue body into the pink and yellow bodies. Each body operates on a quarter of the input array in parallel. The last quarter is processed later in step 5.
The blue body computes the final scan and summary for 1-4. The pink and yellow bodies compute their summaries by prescanning 5-8 and 9-12 respectively.
The pink body computes its summary for 1-8 by performing a reverse_join with the blue body.
The yellow body computes its summary for 1-12 by performing a reverse_join with the pink body.
The blue, pink, and yellow bodies compute final scans and summaries for portions of the array.
The yellow summary is assigned to the blue body. The pink and yellow bodies are destroyed.

Note that two quarters of the array were not prescanned. The parallel_scan template makes an effort to avoid prescanning where possible, to improve performance when there are only a few or no extra worker threads. If no other workers are available, parallel_scan processes the subranges without any pre_scans, by processing the subranges from left to right using final scans. That's why final scans must compute a summary as well as the final scan result. The summary might be needed to process the next subrange if no worker thread has prescanned it yet.

Example Execution of parallel_scan

The following code demonstrates how the signatures could be implemented to use parallel_scan to compute the same result as the earlier sequential example involving ×.

using namespace tbb;

class Body {
    T sum;
    T* const y;
    const T* const z;
public:
    Body( T y_[], const T z_[] ) : sum(id_×), z(z_), y(y_) {}
    T get_sum() const {return sum;}

    template<typename Tag>
    void operator()( const blocked_range<int>& r, Tag ) {
        T temp = sum;
        for( int i=r.begin(); i<r.end(); ++i ) {
            temp = temp × z[i];
            if( Tag::is_final_scan() )
                y[i] = temp;
        }
        sum = temp;
    }
    Body( Body& b, split ) : z(b.z), y(b.y), sum(id_×) {}
    void reverse_join( Body& a ) { sum = a.sum × sum;}
    void assign( Body& b ) {sum = b.sum;}
};

float DoParallelScan( T y[], const T z[], int n ) {
    Body body(y,z);
    parallel_scan( blocked_range<int>(0,n), body );
    return body.get_sum();
}

The definition of operator() demonstrates typical patterns when using parallel_scan.

A single template defines both versions. Doing so is not required, but usually saves coding effort, because the two versions are usually similar. The library defines static method is_final_scan() to enable differentiation between the versions.
The prescan variant computes the × reduction, but does not update y. The prescan is used by parallel_scan to generate look-ahead partial reductions.
The final scan variant computes the × reduction and updates y.

The operation reverse_join is similar to the operation join used by parallel_reduce, except that the arguments are reversed. That is, this is the right argument of ×. Template function parallel_scan decides if and when to generate parallel work. It is thus crucial that × is associative and that the methods of Body faithfully represent it. Operations such as floating-point addition that are somewhat associative can be used, with the understanding that the results may be rounded differently depending upon the association used by parallel_scan. The reassociation may differ between runs even on the same machine. However, if there are no worker threads available, execution associates identically to the serial form shown at the beginning of this section.

If you change the example to use a simple_partitioner, be sure to provide a grainsize. The code below shows the how to do this for a grainsize of 1000:

parallel_scan(blocked_range<int>(0,n,1000), total,
                                     simple_partitioner() );

Parent topic: Algorithms

pre_scan_tag and final_scan_tag Classes

Feature Macros

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Lazy Initialization

≪ Previous: parallel_scan Template Function

Macros in this section control optional features in the library.

`TBB_DEPRECATED` macro

The macro TBB_DEPRECATED controls deprecated features that would otherwise conflict with non-deprecated use. Define it to be 1 to get deprecated Intel® Threading Building Blocks (Intel® TBB) 2.1 interfaces.

`TBB_USE_EXCEPTIONS` macro

The macro TBB_USE_EXCEPTIONS controls whether the library headers use exception-handling constructs such as try, catch, and throw. The headers do not use these constructs when TBB_USE_EXCEPTIONS=0.

For the Microsoft Windows*, Linux*, and OS X* operating systems, the default value is 1 if exception handling constructs are enabled in the compiler, and 0 otherwise.

Caution

The runtime library may still throw an exception when TBB_USE_EXCEPTIONS=0.

`TBB_USE_CAPTURED_EXCEPTION` macro

The macro TBB_USE_CAPTURED_EXCEPTION controls rethrow of exceptions within the library. Because C++ 1998 does not support catching an exception on one thread and rethrowing it on another thread, the library sometimes resorts to rethrowing an approximation called tbb::captured_exception.

Define TBB_USE_CAPTURED_EXCEPTION=1 to make the library rethrow an approximation. This is useful for uniform behavior across platforms.
Define TBB_USE_CAPTURED_EXCEPTION=0 to request rethrow of the exact exception. This setting is valid only on platforms that support the std::exception_ptr feature of C++11. Otherwise a compile-time diagnostic is issued.

On Windows* , Linux* and OS X* operating systems, the default value is 1 for supported host compilers with std::exception_ptr, and 0 otherwise. On IA-64 architecture processors the default value is 0.

Caution

In order for exact exception propagation to work properly an appropriate library binary should be used.

C++11 Support

To enable C++11 specific code, you need to use a compiler that supports C++11 mode, and compile your code with the C++11 mode set. C++11 support is off by default in the compiler. The following table shows the option for turning it on.

Compilation Commands for Setting C++11 Support
Environment	Intel® C++ Compiler (Version 11.0) Compilation Command and Option
Windows* OS systems	`icl /Qstd:c++0x foo.cpp`
Linux* OS systems OS X* systems	`icc -std=c++0x foo.cpp`

Parent topic: Environment

Lazy Initialization

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Scheduler Bypass

≪ Previous: Feature Macros

Problem

Perform an initialization the first time it is needed.

Context

Initializing data structures lazily is a common technique. Not only does it avoid the cost of initializing unused data structures, it is often a more convenient way to structure a program.

Forces

Threads share access to an object.
The object should not be created until the first access.

The second force covers several possible motivations:

The object is expensive to create and creating it early would slow down program startup.
It is not used in every run of the program.
Early initialization would require adding code where it is undesirable for readability or structural reasons.

Fenced Data Transfer

Solutions

A parallel solution is substantially trickier, because it must deal with several concurrency issues.

Races: If two threads attempt to simultaneously access to the object for the first time, and thus cause creation of the object, the race must be resolved in a way that both threads end up with a reference to the same object of type T.
Memory leaks: In the event of a race, the implementation must ensure that any extra transient T objects are cleaned up.
Memory consistency: If thread X executes value=new T(), all other threads must see stores by new T() occur before the assignment value=.
Deadlock: What if the constructor of T() requires acquiring a lock, but the current holder of that lock is also racing to access the object for the first time?

There are two solutions. One is based on double-check locking. The other relies on compare-and-swap. Because the tradeoffs and issues are subtle, most of the discussion is in the following examples section.

Examples

An Intel® Threading Building Blocks (Intel® TBB) implementation of the "double-check" pattern is shown below.

template<typename T, typename Mutex=tbb::mutex>
class lazy {
   tbb::atomic<T*> value;
   Mutex mut;
public:
   lazy() : value() {}                    // Initializes value to NULL
   ~lazy() {delete value;}
   T& get() {
       if( !value ) {                     // Read of value has acquire semantics.
           Mutex::scoped_lock lock(mut);
           if( !value ) value = new T();. // Write of value has release semantics
       }
       return *value;
   }
};

The name comes from the way that the pattern deals with races. There is one check done without locking and one check done after locking. The first check handles the presumably common case that the initialization has already been done, without any locking. The second check deals with cases where two threads both see an uninitialized value, and both try to acquire the lock. In that case, the second thread to acquire the lock will see that the initialization has already occurred.

If T() throws an exception, the solution is correct because value will still be NULL and the mutex unlocked when object lock is destroyed.

The solution correctly addresses memory consistency issues. A write to a tbb::atomic value has release semantics, which means that all of its prior writes will be seen before the releasing write. A read from tbb::atomic value has acquire semantics, which means that all of its subsequent reads will happen after the acquiring read. Both of these properties are critical to the solution. The releasing write ensures that the construction of T() is seen to occur before the assignment to value. The acquiring read ensures that when the caller reads from *value, the reads occur after the "if(!value)" check. The release/acquire is essentially the Fenced Data Transfer pattern, where the "message" is the fully constructed instance T(), and the "ready" flag is the pointer value.

The solution described involves blocking threads while initialization occurs. Hence it can suffer the usual pathologies associated with blocking. For example, if the thread first acquires the lock is suspended by the OS, all other threads will have to wait until that thread resumes. A lock-free variation avoids this problem by making all contending threads attempt initialization, and atomically deciding which attempt succeeds.

An Intel® TBB implementation of the non-blocking variant follows. It also uses double-check, but without a lock.

template<typename T>
class lazy {
   tbb::atomic<T*> value;
public:
   lazy() : value() {}                    // Initializes value to NULL
   ~lazy() {delete value;}
   T& get() {
       if( !value ) {
           T* tmp = new T();
           if( value.compare_and_swap(tmp,NULL)!=NULL )
               // Another thread installed the value, so throw away mine.
               delete tmp;
       }
       return *value;
   }
};

The second check is performed by the expression value.compare_and_swap(tmp,NULL)!=NULL, which conditionally assigns value=tmp if value==NULL, and returns true if the old value was NULL. Thus if multiple threads attempt simultaneous initialization, the first thread to execute the compare_and_swap will set value to point to its T object. Other contenders that execute the compare_and_swap will get back a non-NULL pointer, and know that they should delete their transient T objects.

As with the locking solution, memory consistency issues are addressed by the semantics of tbb::atomic. The first check has acquire semantics and the compare_and_swap has both acquire and release semantics.

References

Lawrence Crowl, "Dynamic Initialization and Destruction with Concurrency", http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2660.htm

Parent topic: Design Patterns

Scheduler Bypass

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Memory Consistency

≪ Previous: Lazy Initialization

Scheduler bypass is an optimization where you directly specify the next task to run. Continuation-passing style often opens up an opportunity for scheduler bypass. For example, at the end of the continuation-passing example in the previous section, method execute() spawns task "a" and returns. By the execution rules in How Task Scheduling Works, that sequence causes the executing thread to do the following:

Push task "a" onto the thread's deque.
Return from method execute().
Pop task "a" from the thread's deque, unless it is stolen by another thread.

Steps 1 and 3 introduce unnecessary deque operations, or worse yet, permit stealing that can hurt locality without adding significant parallelism. Method execute()can avoid these problems by returning a pointer to a instead of spawning it. When using the method shown in How Task Scheduling Works, a becomes the next task executed by the thread. Furthermore, this approach guarantees that the thread executes a, not some other thread.

The following example shows the changes to the example in the previous section in bold font:

struct FibTask: public task {
    ...
    task* execute() {
        if( n<CutOff ) {
            *sum = SerialFib(n);
            return NULL;
        } else {
            FibContinuation& c =
                *new( allocate_continuation() ) FibContinuation(sum);

            FibTask& a = *new( c.allocate_child() ) FibTask(n-2,&c.x);
            FibTask& b = *new( c.allocate_child() ) FibTask(n-1,&c.y);
            // Set ref_count to "two children".
            c.set_ref_count(2);
            spawn( b );
            // spawn( a ); This line removed
            // return NULL; This line removedreturn &a;
        }
    }
};

Parent topic: Useful Task Techniques

Memory Consistency

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Containers

≪ Previous: Scheduler Bypass

Some architectures, such as IA-64 architecture, have "weak memory consistency", in which memory operations on different addresses may be reordered by the hardware for sake of efficiency. The subject is complex, and it is recommended that the interested reader consult other works (Intel 2002, Robison 2003) on the subject. If you are programming only for IA-32 and Intel® 64 architecture platforms, you can skip this section.

Class atomic<T> permits you to enforce certain ordering of memory operations as described in the following table.

Ordering Constraints
Kind	Description	Default For
`acquire`	Operations after the atomic operation never move over it.	read
`release`	Operations before the atomic operation never move over it.	write
sequentially consistent	Operations on either side never move over the atomic operation and the sequentially consistent atomic operations have a global order.	`fetch_and_store` `fetch_and_add` `compare_and_swap`

The rightmost column lists the operations that default to a particular constraint. Use these defaults to avoid unexpected surprises. For read and write, the defaults are the only constraints available. However, if you are familiar with weak memory consistency, you might want to change the default sequential consistency for the other operations to weaker constraints. To do this, use variants that take a template argument. The argument can be acquire or release, which are values of the enum type memory_semantics.

For example, suppose various threads are producing parts of a data structure, and you want to signal a consuming thread when the data structure is ready. One way to do this is to initialize an atomic counter with the number of busy producers, and as each producer finishes, it executes:

refcount.fetch_and_add<release>(-1);

The argument release guarantees that the producer's writes to shared memory occurs before refcount is decremented. Similarly, when the consumer checks refcount, the consumer must use an acquire fence, which is the default for reads, so that the consumer's reads of the data structure do not happen until after the consumer sees refcount become 0.

Parent topic: Atomic Operations

Language English

↧

Containers

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Flow Graph Tips on Nested Parallelism

≪ Previous: Memory Consistency

Intel® Threading Building Blocks (Intel® TBB) provides highly concurrent container classes. These containers can be used with raw Windows* OS or Linux* OS threads, or in conjunction with task-based programming.

A concurrent container allows multiple threads to concurrently access and update items in the container. Typical C++ STL containers do not permit concurrent update; attempts to modify them concurrently often result in corrupting the container. STL containers can be wrapped in a mutex to make them safe for concurrent access, by letting only one thread operate on the container at a time, but that approach eliminates concurrency, thus restricting parallel speedup.

Containers provided by Intel® TBB offer a much higher level of concurrency, via one or both of the following methods:

Fine-grained locking: Multiple threads operate on the container by locking only those portions they really need to lock. As long as different threads access different portions, they can proceed concurrently.
Lock-free techniques: Different threads account and correct for the effects of other interfering threads.

Notice that highly-concurrent containers come at a cost. They typically have higher overheads than regular STL containers. Operations on highly-concurrent containers may take longer than for STL containers. Therefore, use highly-concurrent containers when the speedup from the additional concurrency that they enable outweighs their slower sequential performance.

Caution

As with most objects in C++, the constructor or destructor of a container object must not be invoked concurrently with another operation on the same object. Otherwise the resulting race may cause the operation to be executed on an undefined object.

Parent topic: Intel® Threading Building Blocks (Intel® TBB) User Guide

Flow Graph Tips on Nested Parallelism

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Mapping Nodes to Tasks

≪ Previous: Containers

Parent topic: Flow Graph Tips and Tricks

Language English

↧

Mapping Nodes to Tasks

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: parallel_reduce

≪ Previous: Flow Graph Tips on Nested Parallelism

The following figure shows the timeline for one possible execution of the two node graph example in the previous section. The bodies of n and m will be referred to as λ_n and λ_m, respectively. The three calls to try_put spawn three tasks; each one applies the lambda expression, λ_n, on one of the three input messages. Because n has unlimited concurrency, these tasks can execute concurrently if there are enough threads available. The call to g.wait_for_all() blocks until there are no tasks executing in the graph. As with other wait_for_all functions in Intel TBB, the thread that calls wait_for_all is not spinning idly during this time, but instead can join in executing other tasks from the work pool.

Execution Timeline of a Two Node Graph

As each task from n finishes, it puts its output to m, since m is a successor of n. Unlike node n, m has been constructed with a concurrency limit of 1 and therefore does not spawn all tasks immediately. Instead, it sequentially spawns tasks to execute its body, λ_m, on the messages in the order that they arrive. When all tasks are complete, the call to wait_for_all returns.

Note

All execution in the flow graph happens asynchronously. The calls to try_put return control to the calling thread quickly, after either immediately spawning a task or buffering the message being passed. Likewise, the body tasks execute the lambda expressions and then put the result to any successor nodes. Only the call to wait_for_all blocks, as it should, and even in this case the calling thread may be used to execute tasks from the Intel TBB work pool while it is waiting.

The above timeline shows the sequence when there are enough threads to execute all of the tasks that can be executed in parallel. If there are fewer threads, some spawned tasks will need to wait until a thread is available to execute them.

Parent topic: Basic Flow Graph Concepts

parallel_reduce

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Debug Versus Release Libraries

≪ Previous: Mapping Nodes to Tasks

A loop can do a reduction, as in this summation:

float SerialSumFoo( float a[], size_t n ) {
    float sum = 0;
    for( size_t i=0; i!=n; ++i )
        sum += Foo(a[i]);
    return sum;
}

If the iterations are independent, you can parallelize this loop using the template class parallel_reduce as follows:

float ParallelSumFoo( const float a[], size_t n ) {
    SumFoo sf(a);
    parallel_reduce( blocked_range<size_t>(0,n), sf );
    return sf.my_sum;
}

The class SumFoo specifies details of the reduction, such as how to accumulate subsums and combine them. Here is the definition of class SumFoo:

class SumFoo {
    float* my_a;
public:
    float my_sum;
    void operator()( const blocked_range<size_t>& r ) {
        float *a = my_a;
        float sum = my_sum;
        size_t end = r.end();
        for( size_t i=r.begin(); i!=end; ++i )
            sum += Foo(a[i]);my_sum = sum;}

    SumFoo( SumFoo& x, split ) : my_a(x.my_a), my_sum(0) {}

    void join( const SumFoo& y ) {my_sum+=y.my_sum;}

    SumFoo(float a[] ) :
        my_a(a), my_sum(0)
    {}
};

Note the differences with class ApplyFoo from parallel_for. First, operator() is notconst. This is because it must update SumFoo::my_sum. Second, SumFoo has a splitting constructor and a method join that must be present for parallel_reduce to work. The splitting constructor takes as arguments a reference to the original object, and a dummy argument of type split, which is defined by the library. The dummy argument distinguishes the splitting constructor from a copy constructor.

Tip

In the example, the definition of operator() uses local temporary variables (a, sum, end) for scalar values accessed inside the loop. This technique can improve performance by making it obvious to the compiler that the values can be held in registers instead of memory. If the values are too large to fit in registers, or have their address taken in a way the compiler cannot track, the technique might not help. With a typical optimizing compiler, using local temporaries for only written variables (such as sum in the example) can suffice, because then the compiler can deduce that the loop does not write to any of the other locations, and hoist the other reads to outside the loop.

When a worker thread is available, as decided by the task scheduler, parallel_reduce invokes the splitting constructor to create a subtask for the worker. When the subtask completes, parallel_reduce uses method join to accumulate the result of the subtask. The graph at the top of the following figure shows the split-join sequence that happens when a worker is available:

Graph of the Split-join Sequence

An arc in the above figure indicates order in time. The splitting constructor might run concurrently while object x is being used for the first half of the reduction. Therefore, all actions of the splitting constructor that creates y must be made thread safe with respect to x. So if the splitting constructor needs to increment a reference count shared with other objects, it should use an atomic increment.

If a worker is not available, the second half of the iteration is reduced using the same body object that reduced the first half. That is the reduction of the second half starts where reduction of the first half finished.

Caution

Because split/join are not used if workers are unavailable, parallel_reduce does not necessarily do recursive splitting.

Caution

Because the same body might be used to accumulate multiple subranges, it is critical that operator() not discard earlier accumulations. The code below shows an incorrect definition of SumFoo::operator().

class SumFoo {
    ...
public:
    float my_sum;
    void operator()( const blocked_range<size_t>& r ) {
        ...
        float sum = 0;  // WRONG – should be 'sum = my_sum".
        ...
        for( ... )
            sum += Foo(a[i]);
        my_sum = sum;
    }
    ...
};

With the mistake, the body returns a partial sum for the last subrange instead of all subranges to which parallel_reduce applies it.

The rules for partitioners and grain sizes for parallel_reduce are the same as for parallel_for.

parallel_reduce generalizes to any associative operation. In general, the splitting constructor does two things:

Copy read-only information necessary to run the loop body.
Initialize the reduction variable(s) to the identity element of the operation(s).

The join method should do the corresponding merge(s). You can do more than one reduction at the same time: you can gather the min and max with a single parallel_reduce.

Note

The reduction operation can be non-commutative. The example still works if floating-point addition is replaced by string concatenation.

Parent topic: Parallelizing Simple Loops

Debug Versus Release Libraries

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Iterators

≪ Previous: parallel_reduce

The following table details the Intel® Threading Building Blocks (Intel® TBB) dynamic shared libraries that come in debug and release versions.

Dynamic Shared Libraries Included in Intel® Threading Building Blocks
Library (.dll, lib.so, or lib*.dylib)	Description	When to Use
`tbb_debug` `tbbmalloc_debug` `tbbmalloc_proxy_debug`	These versions have extensive internal checking for correct use of the library.	Use with code that is compiled with the macro `TBB_USE_DEBUG` set to 1.
`tbb` `tbbmalloc` `tbbmalloc_proxy`	These versions deliver top performance. They eliminate most checking for correct use of the library.	Use with code compiled with `TBB_USE_DEBUG` undefined or set to zero.

Tip

Test your programs with the debug versions of the libraries first, to assure that you are using the library correctly. With the release versions, incorrect usage may result in unpredictable program behavior.

Intel® TBB supports Intel® Parallel Inspector, Intel® Inspector XE, Intel® Parallel Amplifier, and Intel® VTune™ Amplifier XE. Full support of these tools requires compiling with macro TBB_USE_THREADING_TOOLS=1. That symbol defaults to 1 in the following conditions:

When TBB_USE_DEBUG=1.
On the Microsoft Windows* operating system, when _DEBUG=1.

The Intel® Threading Building Blocks Reference section explains the default values in more detail.

Caution

The instrumentation support for Intel® Parallel Inspector and Intel® Inspector XE becomes live after the first initialization of the task library. If the library components are used before this initialization occurs, Intel® Parallel Inspector and Intel® Inspector XE may falsely report race conditions that are not really races.

Parent topic: Package Contents

Iterators

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: concurrent_unordered_set and concurrent_unordered_multiset Template Classes

≪ Previous: Debug Versus Release Libraries

Template class concurrent_hash_map supports forward iterators; that is, iterators that can advance only forwards across a table. Reverse iterators are not supported. Concurrent operations (count, find, insert, and erase) invalidate any existing iterators that point into the table, An exception to this rule is that count and find do not invalidate iterators if no insertions or erasures have occurred after the most recent call to method rehash.

Note

Do not call concurrent operations, including count and find while iterating the table. Use concurrent_unordered_map if concurrent traversal and insertion are required.

The following table provides additional information on the members of this template class.

Member	Description
iterator begin()	Returns: `iterator` pointing to beginning of key-value sequence.
iterator end()	Returns `iterator` pointing to end of key-value sequence.
const_iterator begin() const	Returns: `const_iterator` pointing to the beginning of key-value sequence.
const_iterator end() const	Returns: `const_iterator` pointing to the end of key-value sequence.
std::pair<iterator, iterator> equal_range( const Key& key );	Returns: Pair of iterators `(i,j)` such that the half-open range `[i,j)` contains all pairs in the map (and only such pairs) with keys equal to `key`. Because the map has no duplicate keys, the half-open range is either empty or contains a single pair. Tip This method is serial alternative to concurrent `count` and `find` methods.
std::pair<const_iterator, const_iterator> equal_range( const Key& key ) const;	See `std::pair<iterator, iterator> equal_range( const Key& key )`.

Parent topic: concurrent_hash_map Template Class

concurrent_unordered_set and concurrent_unordered_multiset Template Classes

April 6, 2015, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: filter_t Template Class

≪ Previous: Iterators

Summary

Template classes for set containers that supports concurrent insertion and traversal.

Syntax

template <typename Key,
             typename Hasher = tbb_hash<Key>,
             typename Equality = std::equal_to<Key>,
             typename Allocator = tbb::tbb_allocator<Key>
class concurrent_unordered_set;

template <typename Key,
             typename Hasher = tbb_hash<Key>,
             typename Equality = std::equal_to<Key>,
             typename Allocator = tbb::tbb_allocator<Key>
class concurrent_unordered_multiset;

Header

#include "tbb/concurrent_unordered_set.h"

Description

concurrent_unordered_set and concurrent_unordered_multiset support concurrent insertion and traversal, but not concurrent erasure. The interfaces have no visible locking. They may hold locks internally, but never while calling user-defined code. They have semantics similar to the C++11 std::unordered_set and std::unordered_multiset respectively except as follows:

Some methods requiring C++11 language features (e.g. rvalue references) are omitted.
The erase methods are prefixed with unsafe_, to indicate that they are not concurrency safe.
Bucket methods are prefixed with unsafe_ as a reminder that they are not concurrency safe with respect to insertion.
For concurrent_unordered_set, insert methods may create a temporary item that is destroyed if another thread inserts the same item concurrently.
Like std::list, insertion of new items does not invalidate any iterators, nor change the order of items already in the map. Insertion and traversal may be concurrent.
The iterator types iterator and const_iterator are of the forward iterator category.
Insertion does not invalidate or update the iterators returned by equal_range, so insertion may cause non-equal items to be inserted at the end of the range. However, the first iterator will nonetheless point to the found item even after an insertion operation.

Class	Key Difference
`concurrent_unordered_set`	An item may be inserted in `concurrent_unordered_set` only once.
`concurrent_unordered_multiset`	An item may be inserted in `concurrent_unordered_multiset` more than once. `find` will return the first item in the table with a matching search key, though concurrent accesses to the container may insert other other occurrences of the same item before the one returned.

Caution

As with any form of hash table, keys that are equal must have the same hash code, and the ideal hash function distributes keys uniformly across the hash code space.

Members of `concurrent_unordered_set` and `concurrent_unordered_multiset`

In the following synopsis, methods shown in bold font may be concurrently invoked. For example, three different threads can concurrently call methods insert, begin, and size. Their results might be non-deterministic. For example, the result from size might correspond to before, or after the insertion.

public:// types
    typedef Key key_type;
    typedef Key value_type;
    typedef Key mapped_type;
    typedef Hash hasher;
    typedef Equality key_equal;
    typedef Alloc allocator_type;
    typedef typename allocator_type::pointer pointer;
    typedef typename allocator_type::const_pointer const_pointer;
    typedef typename allocator_type::reference reference;
    typedef typename allocator_type::const_reference const_reference;
    typedef implementation-defined size_type;
    typedef implementation-defined difference_type;
    typedef implementation-defined iterator;
    typedef implementation-defined const_iterator;
    typedef implementation-defined local_iterator;
    typedef implementation-defined const_local_iterator;

    allocator_type get_allocator() const;

    // size and capacity
    bool empty() const;     // May take linear time!
    size_type size() const; // May take linear time!
    size_type max_size() const;// iterators
    iterator begin();
    const_iterator begin() const;
    iterator end();
    const_iterator end() const;
    const_iterator cbegin() const;
    const_iterator cend() const;// modifiers
    std::pair<iterator, bool> insert(const value_type& x);
    iterator insert(const_iterator hint, const value_type& x);
    template<class InputIterator>
    void insert(InputIterator first,InputIterator last);// C++11 specific
    std::pair<iterator, bool> insert(value_type&& x);
    iterator insert(const_iterator hint, value_type&& x);
    void insert(std::initializer_list<value_type> il);// C++11 specific
    template<typename... Args>
    std::pair<iterator, bool> emplace(Args&&... args);
    template<typename... Args>
    iterator emplace_hint(const_iterator hint, Args&&... args);


    iterator unsafe_erase(const_iterator position);
    size_type unsafe_erase(const key_type& k);
    iterator unsafe_erase(const_iterator first, const_iterator last);
    void clear();// observers
    hasher hash_function() const;
    key_equal key_eq() const;// lookup
    iterator find(const key_type& k);
    const_iterator find(const key_type& k) const;
    size_type count(const key_type& k) const;
    std::pair<iterator, iterator> equal_range(const key_type& k);
    std::pair<const_iterator, const_iterator> equal_range(const key_type& k) const;// parallel iteration
    typedef implementation defined range_type;
    typedef implementation defined const_range_type;
    range_type range();
    const_range_type range() const;// bucket interface - for debugging
    size_type unsafe_bucket_count() const;
    size_type unsafe_max_bucket_count() const;
    size_type unsafe_bucket_size(size_type n);
    size_type unsafe_bucket(const key_type& k) const;
    local_iterator unsafe_begin(size_type n);
    const_local_iterator unsafe_begin(size_type n) const;
    local_iterator unsafe_end(size_type n);
    const_local_iterator unsafe_end(size_type n) const;
    const_local_iterator unsafe_cbegin(size_type n) const;
    const_local_iterator unsafe_cend(size_type n) const;// hash policy
    float load_factor() const;
    float max_load_factor() const;
    void max_load_factor(float z);
    void rehash(size_type n);

Members of `concurrent_unordered_set`

public:// construct/destroy/copy
    explicit concurrent_unordered_set(size_type n = implementation-defined,
                                      const Hasher& hf = hasher(),
                                      const key_equal& eql = key_equal(),
                                      const allocator_type& a = allocator_type());
    template <typename InputIterator>
    concurrent_unordered_set(InputIterator first, InputIterator last,
                             size_type n = implementation-defined,
                             const hasher& hf = hasher(),
                             const key_equal& eql = key_equal(),
                             const allocator_type& a = allocator_type());
    concurrent_unordered_set(const concurrent_unordered_set&);
    concurrent_unordered_set(const Alloc&);
    concurrent_unordered_set(const concurrent_unordered_set&, const Alloc&);//C++11 specific
    concurrent_unordered_set(concurrent_unordered_set&&);
    concurrent_unordered_set(concurrent_unordered_set&&, const Allocator&);
    concurrent_unordered_set(std::initializer_list<value_type> il,
                             size_type n = implementation-defined,
                             const Hasher& hf = hasher(),
                             const key_equal& eql = key_equal(),
                             const allocator_type& a = allocator_type());
    ~concurrent_unordered_set();

    concurrent_unordered_set& operator=(const concurrent_unordered_set&);//C++11 specific
    concurrent_unordered_set& operator=(concurrent_unordered_set&&)
    concurrent_unordered_set& operator=(std::initializer_list<value_type> il);

    void swap(concurrent_unordered_set&);

Members of `concurrent_unordered_multiset`

public:// construct/destroy/copy
    explicit concurrent_unordered_multiset(size_type n = implementation-defined,
                                           const Hasher& hf = hasher(),
                                           const key_equal& eql = key_equal(),
                                           const allocator_type& a = allocator_type());
    template <typename InputIterator>
    concurrent_unordered_multiset(InputIterator first, InputIterator last,
                                  size_type n = implementation-defined,
                                  const hasher& hf = hasher(),
                                  const key_equal& eql = key_equal(),
                                  const allocator_type& a = allocator_type());
    concurrent_unordered_multiset(const concurrent_unordered_multiset&);
    concurrent_unordered_multiset(const Alloc&);
    concurrent_unordered_multiset(const concurrent_unordered_multiset&, const Alloc&);//C++11 specific
    concurrent_unordered_multiset(concurrent_unordered_multiset&&);
    concurrent_unordered_multiset(concurrent_unordered_multiset&&, const Allocator&);
    concurrent_unordered_multiset(std::initializer_list<value_type> il,
                                  size_type n = implementation-defined,
                                  const Hasher& hf = hasher(),
                                  const key_equal& eql = key_equal(),
                                  const allocator_type& a = allocator_type());
    ~concurrent_unordered_multiset();

    concurrent_unordered_multiset& operator=(const concurrent_unordered_multiset&);//C++11 specific
    concurrent_unordered_multiset& operator=(concurrent_unordered_multiset&&)
    concurrent_unordered_multiset& operator=(std::initializer_list<value_type> il);

    void swap(concurrent_unordered_multiset&);

Parent topic: Containers Overview

See Also

See Also

Caution

See Also

concurrent_unordered_map

concurrent_unordered_multimap

Summary

Header

Syntax

Description

Tip

See Also

TBB_DEPRECATED macro

TBB_USE_EXCEPTIONS macro

Caution

TBB_USE_CAPTURED_EXCEPTION macro

Caution

C++11 Support

See Also

Problem

Context

Forces

Related

Solutions

Examples

References

See Also

See Also

Caution

See Also

Note

See Also

Tip

Caution

Caution

Note

See Also

Tip

Caution

See Also

Note

Tip

See Also

Summary

Syntax

Header

Description

Caution

Members of concurrent_unordered_set and concurrent_unordered_multiset

Members of concurrent_unordered_set

Members of concurrent_unordered_multiset

See Also

`concurrent_unordered_map`

`concurrent_unordered_multimap`

`TBB_DEPRECATED` macro

`TBB_USE_EXCEPTIONS` macro

`TBB_USE_CAPTURED_EXCEPTION` macro

Members of `concurrent_unordered_set` and `concurrent_unordered_multiset`

Members of `concurrent_unordered_set`

Members of `concurrent_unordered_multiset`