Quantcast
Channel: Intel® Threading Building Blocks
Viewing all 1148 articles
Browse latest View live

parallel_for Template Function

$
0
0

Summary

Template function that performs parallel iteration over a range of values.

Header

 #include "tbb/parallel_for.h"

Syntax

template<typename Index, typename Func>
Func parallel_for( Index first, Index_type last, const Func& f
                   [, partitioner[, task_group_context& group]] );

template<typename Index, typename Func>
Func parallel_for( Index first, Index_type last,
                   Index step, const Func& f
                   [, partitioner[, task_group_context& group]] );

template<typename Range, typename Body>
void parallel_for( const Range& range, const Body& body,
                   [, partitioner[, task_group_context& group]] );

Description

A parallel_for(first,last,step,f) represents parallel execution of the loop:

for( auto i=first; i<last; i+=step ) f(i);

The index type must be an integral type. The loop must not wrap around. The step value must be positive. If omitted, it is implicitly 1. There is no guarantee that the iterations run in parallel. Deadlock may occur if a lesser iteration waits for a greater iteration. The partitioning strategy is auto_partitioner when the parameter is not specified.

A parallel_for(range,body,partitioner) provides a more general form of parallel iteration. It represents parallel execution of body over each value in range. The optional partitioner specifies a partitioning strategy. Type Range must model the Range concept . The body must model the requirements in the following table.

Requirements for parallel_for Body

Pseudo-Signature

Semantics

Body::Body( const Body& )

Copy constructor.

Body::~Body()

Destructor.

void Body::operator()( Range& range ) const

Apply body to range.

A parallel_for recursively splits the range into subranges to the point such that is_divisible() is false for each subrange, and makes copies of the body for each of these subranges. For each such body/subrange pair, it invokes Body::operator(). The invocations are interleaved with the recursive splitting, in order to minimize space overhead and efficiently use cache.

Some of the copies of the range and body may be destroyed after parallel_for returns. This late destruction is not an issue in typical usage, but is something to be aware of when looking at execution traces or writing range or body objects with complex side effects.

When worker threads are available, parallel_for executes iterations in non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.

When no worker threads are available, parallel_for executes iterations from left to right in the following sense. Imagine drawing a binary tree that represents the recursive splitting. Each non-leaf node represents splitting a subrange r by invoking one of the splitting constructors of Range. The left child represents the updated value of r. The right child represents the newly constructed object. Each leaf in the tree represents an indivisible subrange. The method Body::operator() is invoked on each leaf subrange, from left to right.

All overloads can be passed a task_group_context object so that the algorithm’s tasks are executed in this group. By default the algorithm is executed in a bound group of its own.

Complexity

If the range and body take O(1) space, and the range splits into nearly equal pieces, then the space complexity is O(P log(N)), where N is the size of the range and P is the number of threads.

Example

This example defines a routine ParallelAverage that sets output[i] to the average of input[i-1], input[i], and input[i+1], for 1 <= i< n.

#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"

using namespace tbb;

struct Average {
    const float* input;
    float* output;
    void operator()( const blocked_range<int>& range ) const {
        for( int i=range.begin(); i!=range.end(); ++i )
            output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.f);
    }
};

// Note: Reads input[0..n] and writes output[1..n-1].
void ParallelAverage( float* output, const float* input, size_t n ) {
    Average avg;
    avg.input = input;
    avg.output = output;
    parallel_for( blocked_range<int>( 1, n ), avg );
    }

Example

This example is more complex and requires familiarity with STL. It shows the power of parallel_for beyond flat iteration spaces. The code performs a parallel merge of two sorted sequences. It works for any sequence with a random-access iterator. The algorithm (Akl 1987) works recursively as follows:

  1. If the sequences are too short for effective use of parallelism, do a sequential merge. Otherwise perform steps 2-6.
  2. Swap the sequences if necessary, so that the first sequence [begin1,end1) is at least as long as the second sequence [begin2,end2).
  3. Set m1 to the middle position in [begin1,end1). Call the item at that location key.
  4. Set m2 to where key would fall in [begin2,end2).
  5. Merge [begin1,m1) and [begin2,m2) to create the first part of the merged sequence.
  6. Merge [m1,end1) and [m2,end2) to create the second part of the merged sequence.

The Intel® Threading Building Blocks implementation of this algorithm uses the range object to perform most of the steps. Predicate is_divisible performs the test in step 1, and step 2. The splitting constructor does steps 3-6. The body object does the sequential merges.

#include "tbb/parallel_for.h"
#include <algorithm>

using namespace tbb;

template<typename Iterator>
struct ParallelMergeRange {
    static size_t grainsize;
    Iterator begin1, end1; // [begin1,end1) is 1st sequence to be merged
    Iterator begin2, end2; // [begin2,end2) is 2nd sequence to be merged
    Iterator out;               // where to put merged sequence
    bool empty()   const {return (end1-begin1)+(end2-begin2)==0;}
    bool is_divisible() const {
        return std::min( end1-begin1, end2-begin2 ) > grainsize;
    }
    ParallelMergeRange( ParallelMergeRange& r, split ) {
        if( r.end1-r.begin1 < r.end2-r.begin2 ) {
            std::swap(r.begin1,r.begin2);
            std::swap(r.end1,r.end2);
        }
        Iterator m1 = r.begin1 + (r.end1-r.begin1)/2;
        Iterator m2 = std::lower_bound( r.begin2, r.end2, *m1 );
        begin1 = m1;
        begin2 = m2;
        end1 = r.end1;
        end2 = r.end2;
        out = r.out + (m1-r.begin1) + (m2-r.begin2);
        r.end1 = m1;
        r.end2 = m2;
    }
    ParallelMergeRange( Iterator begin1_, Iterator end1_,
                        Iterator begin2_, Iterator end2_,
                        Iterator out_ ) :
        begin1(begin1_), end1(end1_),
        begin2(begin2_), end2(end2_), out(out_)
    {}
};

template<typename Iterator>
size_t ParallelMergeRange<Iterator>::grainsize = 1000;

template<typename Iterator>
struct ParallelMergeBody {
    void operator()( ParallelMergeRange<Iterator>& r ) const {
        std::merge( r.begin1, r.end1, r.begin2, r.end2, r.out );
    }
};

template<typename Iterator>
void ParallelMerge( Iterator begin1, Iterator end1, Iterator begin2, Iterator end2, Iterator out ) {
    parallel_for(
       ParallelMergeRange<Iterator>(begin1,end1,begin2,end2,out),
       ParallelMergeBody<Iterator>(),
       simple_partitioner()
    );
}

Because the algorithm moves many locations, it tends to be bandwidth limited. Speedup varies, depending upon the system.

English

Environment

Non-Preemptive Priorities

$
0
0

Problem

Choose the next work item to do, based on priorities.

Context

The scheduler in Intel® Threading Building Blocks (Intel® TBB) chooses tasks using rules based on scalability concerns. The rules are based on the order in which tasks were spawned or enqueued, and are oblivious to the contents of tasks. However, sometimes it is best to choose work based on some kind of priority relationship.

Forces

  • Given multiple work items, there is a rule for which item should be done next that is not the default Intel® TBB rule.

  • Preemptive priorities are not necessary. If a higher priority item appears, it is not necessary to immediately stop lower priority items in flight. If preemptive priorities are necessary, then non-preemptive tasking is inappropriate. Use threads instead.

Solution

Put the work in a shared work pile. Decouple tasks from specific work, so that task execution chooses the actual piece of work to be selected from the pile.

Example

The following example implements three priority levels. The user interface for it and top-level implementation follow:

enum Priority {
   P_High,
   P_Medium,
   P_Low
};

template<typename Func>
void EnqueueWork( Priority p, Func f ) {
   WorkItem* item = new ConcreteWorkItem<Func>( p, f );
   ReadyPile.add(item);
}

The caller provides a priority p and a functor f to routine EnqueueWork. The functor may be the result of a lambda expression. EnqueueWork packages f as a WorkItem and adds it to global object ReadyPile.

Class WorkItem provides a uniform interface for running functors of unknown type:

// Abstract base class for a prioritized piece of work.
class WorkItem {
public:
   WorkItem( Priority p ) : priority(p) {}
   // Derived class defines the actual work.
   virtual void run() = 0;
   const Priority priority;
};

template<typename Func>
class ConcreteWorkItem: public WorkItem {
   Func f;
   /*override*/ void run() {
       f();
       delete this;
   }
public:
   ConcreteWorkItem( Priority p, const Func& f_ ) :
       WorkItem(p), f(f_)
   {}
};

Class ReadyPile contains the core pattern. It maintains a collection of work and fires off tasks that choose work from the collection:

class ReadyPileType {
   // One queue for each priority level
   tbb::concurrent_queue<WorkItem*> level[P_Low+1];
public:
   void add( WorkItem* item ) {
       level[item->priority].push(item);
       tbb::task::enqueue(*new(tbb::task::allocate_root()) RunWorkItem);
   }
   void runNextWorkItem() {
       // Scan queues in priority order for an item.
       WorkItem* item=NULL;
       for( int i=P_High; i<=P_Low; ++i )
           if( level[i].try_pop(item) )
               break;
       assert(item);
       item->run();
   }
};

ReadyPileType ReadyPile;

The task enqueued by add(item) does not necessarily execute that item. The task executes runNextWorkItem(), which may find a higher priority item. There is one task for each item, but the mapping resolves when the task actually executes, not when it is created.

Here are the details of class RunWorkItem:

class RunWorkItem: public tbb::task {
   /*override*/tbb::task* execute(); // Private override of virtual method
};
...
tbb::task* RunWorkItem::execute() {
   ReadyPile.runNextWorkItem();
   return NULL;
};

RunWorkItem objects are fungible. They enable the Intel® TBB scheduler to choose when to do a work item, not which work item to do. The override of virtual method task::execute is private because all calls to it are dispatched via base class task.

Other priority schemes can be implemented by changing the internals for ReadyPileType. A priority queue could be used to implement very fine grained priorities.

The scalability of the pattern is limited by the scalability of ReadyPileType. Ideally scalable concurrent containers should be used for it.

English

Useful Task Techniques

Lock Pathologies

$
0
0

Locks can introduce performance and correctness problems. If you are new to locking, here are some of the problems to avoid:

Deadlock

Deadlock happens when threads are trying to acquire more than one lock, and each holds some of the locks the other threads need to proceed. More precisely, deadlock happens when:

  • There is a cycle of threads

  • Each thread holds at least one lock on a mutex, and is waiting on a mutex for which the next thread in the cycle already has a lock.

  • No thread is willing to give up its lock.

Think of classic gridlock at an intersection – each car has "acquired" part of the road, but needs to "acquire" the road under another car to get through. Two common ways to avoid deadlock are:

  • Avoid needing to hold two locks at the same time. Break your program into small actions in which each can be accomplished while holding a single lock.

  • Always acquire locks in the same order. For example, if you have "outer container" and "inner container" mutexes, and need to acquire a lock on one of each, you could always acquire the "outer sanctum" one first. Another example is "acquire locks in alphabetical order" in a situation where the locks have names. Or if the locks are unnamed, acquire locks in order of the mutex’s numerical addresses.

  • Use atomic operations instead of locks.

Convoying

Another common problem with locks is convoying. Convoying occurs when the operating system interrupts a thread that is holding a lock. All other threads must wait until the interrupted thread resumes and releases the lock. Fair mutexes can make the situation even worse, because if a waiting thread is interrupted, all the threads behind it must wait for it to resume.

To minimize convoying, try to hold the lock as briefly as possible. Precompute whatever you can before acquiring the lock.

To avoid convoying, use atomic operations instead of locks where possible.

English

Exceptions and Cancellation

$
0
0

Intel® Threading Building Blocks (Intel® TBB) supports exceptions and cancellation. When code inside an Intel® TBB algorithm throws an exception, the following steps generally occur:

  1. The exception is captured. Any further exceptions inside the algorithm are ignored.

  2. The algorithm is cancelled. Pending iterations are not executed. If there is Intel® TBB parallelism nested inside, the nested parallelism may also be cancelled as explained in Cancellation and Nested Parallelism.

  3. Once all parts of the algorithm stop, an exception is thrown on the thread that invoked the algorithm.

The exception thrown in step 3 might be the original exception, or might merely be a summary of type captured_exception. The latter usually occurs on current systems because propagating exceptions between threads requires support for the C++ std::exception_ptr functionality. As compilers evolve to support this functionality, future versions of Intel® TBB might throw the original exception. So be sure your code can catch either type of exception. The following example demonstrates exception handling.

#include "tbb/tbb.h"
#include <vector>
#include <iostream>

using namespace tbb;
using namespace std;

vector<int> Data;

struct Update {
    void operator()( const blocked_range<int>& r ) const {
        for( int i=r.begin(); i!=r.end(); ++i )
            Data.at(i) += 1;
    }
};

int main() {
    Data.resize(1000);
    try {
        parallel_for( blocked_range<int>(0, 2000), Update());
    } catch( captured_exception& ex ) {
       cout << "captured_exception: "<< ex.what() << endl;
    } catch( out_of_range& ex ) {
       cout << "out_of_range: "<< ex.what() << endl;
    }
    return 0;
}

The parallel_for attempts to iterate over 2000 elements of a vector with only 1000 elements. Hence the expression Data.at(i) sometimes throws an exception std::out_of_range during execution of the algorithm. When the exception happens, the algorithm is cancelled and an exception thrown at the call site to parallel_for.

English

Communication Between Graphs

$
0
0

All graph nodes require a reference to a graph object as one of the arguments to their constructor. It is only safe to construct edges between nodes that are part of the same graph. An edge expresses the topology of your graph to the runtime library. Connecting two nodes in different graphs can make it difficult to reason about whole graph operations, such as calls to graph::wait_for_all and exception handling. To optimize performance, the library may make calls to a node's predecessor or successor at times that are unexpected by the user.

If two graphs must communicate, do NOT create an edge between them, but instead use explicit calls to try_put. This will prevent the runtime library from making any assumptions about the relationship of the two nodes, and therefore make it easier to reason about events that cross the graph boundaries. However, it may still be difficult to reason about whole graph operations. For example, consider the graphs below:

    graph g;
    function_node< int, int > n1( g, 1, [](int i) -> int {
        cout << "n1\n";
        spin_for(i);
        return i;
    } );
    function_node< int, int > n2( g, 1, [](int i) -> int {
        cout << "n2\n";
        spin_for(i);
        return i;
    } );
    make_edge( n1, n2 );

    graph g2;
    function_node< int, int > m1( g2, 1, [](int i) -> int {
        cout << "m1\n";
        spin_for(i);
        return i;
    } );
    function_node< int, int > m2( g2, 1, [&](int i) -> int {
        cout << "m2\n";
        spin_for(i);
        n1.try_put(i);
        return i;
    } );
    make_edge( m1, m2 );

    m1.try_put( 1 );

    // The following call returns immediately:
    g.wait_for_all();
    // The following call returns after m1 & m2
    g2.wait_for_all();

    // we reach here before n1 & n2 are finished
    // even though wait_for_all was called on both graphs

In the example above, m1.try_put(1) sends a message to node m1, which runs its body and then sends a message to node m2. Next, node m2 runs its body and sends a message to n1 using an explicit try_put. In turn, n1 runs its body and sends a message to n2. The runtime library does not consider m2 to be a predecessor of n1 since no edge exists.

If you want to wait until all of the tasks spawned by these graphs are done, you need to call the function wait_for_all on both graphs. However, because there is cross-graph communication, the order of the calls is important. In the (incorrect) code segment above, the first call to g.wait_for_all() returns immediately because there are no tasks yet active in g; the only tasks that have been spawned by then belong to g2. The call to g2.wait_for_all returns after both m1 and m2 are done, since they belong to g2; the call does not however wait for n1 and n2, since they belong to g. The end of this code segment is therefore reached before n1 and n2 are done.

If the calls to wait_for_all are swapped, the code works as expected:

    g2.wait_for_all();
    g.wait_for_all();

    // all tasks are done

While it is not too difficult to reason about how these two very small graphs interact, the interaction of two larger graphs, perhaps with cycles, will be more difficult to understand. Therefore, communication between nodes in different graphs should be done with caution.

English

Graph Object

$
0
0

Conceptually a flow graph is a collection of nodes and edges. Each node belongs to exactly one graph and edges are made only between nodes in the same graph. In the flow graph interface, a graph object represents this collection of nodes and edges, and is used for invoking whole graph operations such as waiting for all tasks related to the graph to complete, resetting the state of all nodes in the graph, and canceling the execution of all nodes in the graph.

The code below creates a graph object and then waits for all tasks spawned by the graph to complete. The call to wait_for_all in this example returns immediately since this is a trivial graph with no nodes or edges, and therefore no tasks are spawned.

graph g;
g.wait_for_all();

English

Controlling Chunking

$
0
0

Chunking is controlled by a partitioner and a grainsize.To gain the most control over chunking, you specify both.

  • Specify simple_partitioner() as the third argument to parallel_for. Doing so turns off automatic chunking.

  • Specify the grainsize when constructing the range. The thread argument form of the constructor is blocked_range<T>(begin,end,grainsize). The default value of grainsize is 1. It is in units of loop iterations per chunk.

If the chunks are too small, the overhead may exceed the performance advantage.

The following code is the last example from parallel_for, modified to use an explicit grainsize G. Additions are shown in bold font.

#include "tbb/tbb.h"

void ParallelApplyFoo( float a[], size_t n ) {
    parallel_for(blocked_range<size_t>(0,n,G), ApplyFoo(a),
                 simple_partitioner());
}

The grainsize sets a minimum threshold for parallelization. The parallel_for in the example invokes ApplyFoo::operator() on chunks, possibly of different sizes. Let chunksize be the number of iterations in a chunk. Using simple_partitioner guarantees that G/2chunksize G.

There is also an intermediate level of control where you specify the grainsize for the range, but use an auto_partitioner and affinity_partitioner. An auto_partitioner is the default partitioner. Both partitioners implement the automatic grainsize heuristic described in Automatic Chunking. An affinity_partitioner implies an additional hint, as explained later in Section Bandwidth and Cache Affinity. Though these partitioners may cause chunks to have more than G iterations, they never generate chunks with less than G/2 iterations. Specifying a range with an explicit grainsize may occasionally be useful to prevent these partitioners from generating wastefully small chunks if their heuristics fail.

Because of the impact of grainsize on parallel loops, it is worth reading the following material even if you rely on auto_partitioner and affinity_partitioner to choose the grainsize automatically.

Packaging Overhead Versus Grainsize




Case A

Case B

The above figure illustrates the impact of grainsize by showing the useful work as the gray area inside a brown border that represents overhead. Both Case A and Case B have the same total gray area. Case A shows how too small a grainsize leads to a relatively high proportion of overhead. Case B shows how a large grainsize reduces this proportion, at the cost of reducing potential parallelism. The overhead as a fraction of useful work depends upon the grainsize, not on the number of grains. Consider this relationship and not the total number of iterations or number of processors when setting a grainsize.

A rule of thumb is that grainsize iterations of operator() should take at least 100,000 clock cycles to execute. For example, if a single iteration takes 100 clocks, then the grainsize needs to be at least 1000 iterations. When in doubt, do the following experiment:

  1. Set the grainsize parameter higher than necessary. The grainsize is specified in units of loop iterations. If you have no idea of how many clock cycles an iteration might take, start with grainsize=100,000. The rationale is that each iteration normally requires at least one clock per iteration. In most cases, step 3 will guide you to a much smaller value.

  2. Run your algorithm.

  3. Iteratively halve the grainsize parameter and see how much the algorithm slows down or speeds up as the value decreases.

A drawback of setting a grainsize too high is that it can reduce parallelism. For example, if the grainsize is 1000 and the loop has 2000 iterations, the parallel_for distributes the loop across only two processors, even if more are available. However, if you are unsure, err on the side of being a little too high instead of a little too low, because too low a value hurts serial performance, which in turns hurts parallel performance if there is other parallelism available higher up in the call tree.

Tip

You do not have to set the grainsize too precisely.

The next figure shows the typical "bathtub curve" for execution time versus grainsize, based on the floating point a[i]=b[i]*c computation over a million indices. There is little work per iteration. The times were collected on a four-socket machine with eight hardware threads.

Wall Clock Time Versus Grainsize

The scale is logarithmic. The downward slope on the left side indicates that with a grainsize of one, most of the overhead is parallel scheduling overhead, not useful work. An increase in grainsize brings a proportional decrease in parallel overhead. Then the curve flattens out because the parallel overhead becomes insignificant for a sufficiently large grainsize. At the end on the right, the curve turns up because the chunks are so large that there are fewer chunks than available hardware threads. Notice that a grainsize over the wide range 100-100,000 works quite well.

Tip

A general rule of thumb for parallelizing loop nests is to parallelize the outermost one possible. The reason is that each iteration of an outer loop is likely to provide a bigger grain of work than an iteration of an inner loop.

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

English

Notational Conventions

$
0
0

The following conventions may be used in this document.

Convention

Explanation

Example

Italic

Used for introducing new terms, denotation of terms, placeholders, or titles of manuals.

The filename consists of the basename and the
extension.

For more information, refer to the Intel® Linker Manual.

Monospace

Indicates directory paths and filenames, commands and command line

options, function names, methods, classes, data structures in body text, source code.

ippsapi.h

\alt\include

Use the okCreateObjs() function to...

printf("hello, world\n");

Monospace italic

Indicates source code placeholders.

blocked_range<Type>

Monospace bold

Emphasizes parts of source code.

x = ( h > 0 ? sizeof(m) : 0xF ) + min;

[ ]

Items enclosed in brackets are optional.

Fa[c]

Indicates Fa or Fac.

{ | }

Braces and vertical bars indicate the choice of one item from a selection of two or more items.

X{K | W | P}

Indicates XK, XW, or XP.

"[""]""{"
" }""|"

Writing a metacharacter in quotation marks negates the syntactical meaning stated above;
the character is taken as a literal.

"[" X "]" [ Y ]

Denotes the letter X enclosed in brackets, optionally followed by the letter Y.

...

The ellipsis indicates that the previous item can be repeated several times.

filename ...

Indicates that one or more filenames can be specified.

,...

The ellipsis preceded by a comma indicates that the previous item can be repeated several times,
separated by commas.

word ,...

Indicates that one or more words can be specified. If more than one word is specified, the words are comma-separated.

Class members are summarized by informal class declarations that describe the class as it seems to clients, not how it is actually implemented. For example, here is an informal declaration of class Foo:

class Foo {
public:
        int x();
        int y;
        ~Foo();
};

The actual implementation might look like:

namespace internal {
        class FooBase  {
        protected:
                int x();
        };

        class Foo_v3: protected FooBase {
        private:
                int internal_stuff;
        public:
                using FooBase::x;
                int y;
        };
}

typedef internal::Foo_v3 Foo;

The example shows two cases where the actual implementation departs from the informal declaration:

  • Foo is actually a typedef to Foo_v3.

  • Method x() is inherited from a protected base class.

  • The destructor is an implicit method generated by the compiler.

The informal declarations are intended to show you what you need to know to use the class without the distraction of irrelevant clutter particular to the implementation.

English

Threading Intel® Integrated Performance Primitives Image Resize with Intel® Threading Building Blocks

$
0
0

Threading Intel® IPP Image Resize with Intel® TBB.pdf (157.18 KB) :Download Now

 

Introduction

The Intel® Integrated Performance Primitives (Intel® IPP) library provides a wide variety of vectorized signal and image processing functions. Intel® Threading Building Blocks (Intel® TBB) adds simple but powerful abstractions for expressing parallelism in C++ programs. This article presents a starting point for using these tools together to combine the benefits of vectorization and threading to resize images.   

From Intel® IPP 8.2 onwards multi-threading (internal threaded) libraries are deprecated due to issues with performance and interoperability with other threading models, but made available for legacy applications. However, multithreaded programming is now main stream and there is a rich ecosystem of threading tools such as Intel® TBB.  In most cases, handling threading at an application level (that is, external/above the primitives) offers many advantages.  Many applications already have their own threading model, and application level/external threading gives developers the greatest level of flexibility and control.  With a little extra effort to add threading to applications it is possible to meet or exceed internal threading performance, and this opens the door to more advanced optimization techniques such as reusing local cache data for multiple operations.  This is the main reason to start deprecating internal threading in the latest releases.

Getting started with parallel_for

Intel® TBB’s parallel_for offers an easy way to get started with parallelism, and it is one of the most commonly used parts of Intel® TBB. Any for() loop in the applications, where  each iteration can be done independently and the order of execution doesn’t matter.  In these scenarios, Intel® TBB parallel_for is useful and takes care of most details, like setting up a thread pool and a scheduler. You supply the partitioning scheme and the code to run on separate threads or cores. More sophisticated approaches are possible. However, the goal of this article and sample code is to provide a simple starting point and not the best possible threading configuration for every situation.

Intel® TBB’s parallel_for takes 2 or 3 arguments. 

parallel_for ( range, body, optional partitioner ) 

The range, for this simplified line-based partitioning, is specified by:

blocked_range<int>(begin, end, grainsize)

This provides information to each thread about which lines of the image it is processing. It will automatically partition a range from begin to end in grainsize chunks.  For Intel® TBB the grainsize is automatically adjusted when ranges don't partition evenly, so it is easy to accommodate arbitrary sizes.

The body is the section of code to be parallelized. This can be implemented separately (including as part of a class); though for simple cases it is often convenient to use a lambda expression. With the lambda approach the entire function body is part of the parallel_for call. Variables to pass to this anonymous function are listed in brackets [alg, pSrc, pDst, stridesrc_8u, …] and range information is passed via blocked_range<int>& range.

This is a general threading abstraction which can be applied to a wide variety of problems.  There are many examples elsewhere showing parallel_for with simple loops such as array operations.  Tailoring for resize follows the same pattern.

External Parallelization for Intel® IPP Resize

A threaded resize can be split into tiles of any shape. However, it is convenient to use groups of rows where the tiles are the width of the image.

Each thread can query range.begin(), range.size(), etc. to determine offsets into the image buffer. Note: this starting point implementation assumes that the entire image is available within a single buffer in memory. 

The new image resize functions in Intel® IPP 7.1 and later versions, new approach has many advantages like

  • IppiResizeSpec holds precalculated coefficients based on input/output resolution combination. Multiple resizes which can be completed without recomputing them.
  • Separate functions for each interpolation method.
  • Significantly smaller executable size footprint with static linking.
  • Improved support for threading and tiled image processing.
  • For more information please refer to article : Resize Changes in Intel® IPP 7.1

Before starting resize, the offsets (number of bytes to add to the source and destination pointers to calculate where each thread’s region starts) must be calculated. Intel® IPP provides a convenient function for this purpose:

ippiResizeGetSrcOffset

This function calculates the corresponding offset/location in the source image for a location in the destination image. In this case, the destination offset is the beginning of the thread’s blocked range.

After this function it is easy to calculate the source and destination addresses for each thread’s current work unit:

pSrcT=pSrc+(srcOffset.y*stridesrc_8u);
pDstT=pDst+(dstOffset.y*stridedst_8u);

These are plugged into the resize function, like this:

ippiResizeLanczos_8u_C1R(pSrcT, stridesrc_8u, pDstT, stridedst_8u, dstOffset, dstSizeT, ippBorderRepl, 0, pSpec, localBuffer);

This specifies how each thread works on a subset of lines of the image. Instead of using the beginning of the source and destination buffers, pSrcT and pDstT provide the starting points of the regions each thread is working with. The height of each thread's region is passed to resize via dstSizeT. Of course, in the special case of 1 thread these values are the same as for a nonthreaded implementation.

Another difference to call out is that since each thread is doing its own resize simultaneously the same working buffer cannot be used for all threads. For simplicity the working buffer is allocated within the lambda function with scalable_aligned_malloc, though further efficiency could be gained by pre-allocating a buffer for each thread.

The following code snippet demonstrates how to set up resize within a parallel_for lambda function, and how the concepts described above could be implemented together.  

 Click here for full source code.

By downloading this sample code, you accept the End User License Agreement.

parallel_for( blocked_range<int>( 0, pnminfo_dst.imgsize.height, grainsize ),
            [pSrc, pDst, stridesrc_8u, stridedst_8u, pnminfo_src,
            pnminfo_dst, bufSize, pSpec]( const blocked_range<int>& range )
        {
            Ipp8u *pSrcT,*pDstT;
            IppiPoint srcOffset = {0, 0};
            IppiPoint dstOffset = {0, 0};

            // resized region is the full width of the image,
            // The height is set by TBB via range.size()
            IppiSize  dstSizeT = {pnminfo_dst.imgsize.width,(int)range.size()};

            // set up working buffer for this thread's resize
            Ipp32s localBufSize=0;
            ippiResizeGetBufferSize_8u( pSpec, dstSizeT,
                pnminfo_dst.nChannels, &localBufSize );

            Ipp8u *localBuffer =
                (Ipp8u*)scalable_aligned_malloc( localBufSize*sizeof(Ipp8u), 32);

            // given the destination offset, calculate the offset in the source image
            dstOffset.y=range.begin();
            ippiResizeGetSrcOffset_8u(pSpec,dstOffset,&srcOffset);

            // pointers to the starting points within the buffers that this thread
            // will read from/write to
            pSrcT=pSrc+(srcOffset.y*stridesrc_8u);
            pDstT=pDst+(dstOffset.y*stridedst_8u);


            // do the resize for greyscale or color
            switch (pnminfo_dst.nChannels)
            {
            case 1: ippiResizeLanczos_8u_C1R(pSrcT,stridesrc_8u,pDstT,stridedst_8u,
                        dstOffset,dstSizeT,ippBorderRepl, 0, pSpec,localBuffer); break;
            case 3: ippiResizeLanczos_8u_C3R(pSrcT,stridesrc_8u,pDstT,stridedst_8u,
                        dstOffset,dstSizeT,ippBorderRepl, 0, pSpec,localBuffer); break;
            default:break; //only 1 and 3 channel images
            }

            scalable_aligned_free((void*) localBuffer);
        });
 

As you can see, a threaded implementation can be quite similar to single threaded.  The main difference is simply that the image is partitioned by Intel® TBB to work across several threads, and each thread is responsible for groups of image lines. This is a relatively straightforward way to divide the task of resizing an image across multiple cores or threads.

Conclusion

Intel® IPP provides a suite of SIMD-optimized functions. Intel® TBB provides a simple but powerful way to handle threading in Intel® IPP applications. Using them together allows access to great vectorized performance on each core as well as efficient partitioning to multiple cores. The deeper level of control available with external threading enables more efficient processing and better performance. 

Example code: As with other  Intel® IPP sample code, by downloading you accept the End User License Agreement.

  • resize; Intel IPP threading
  • IPP threading
  • TBB threading
  • IPP & TBB
  • Developers
  • Android*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 10
  • Microsoft Windows* 8.x
  • Tizen*
  • Yocto Project
  • Android*
  • Code for Good
  • Internet of Things
  • Tizen*
  • Windows*
  • C/C++
  • Advanced
  • Intermediate
  • Intel® System Studio
  • Intel® Integrated Performance Primitives
  • Intel® Threading Building Blocks
  • Intel® Advanced Vector Extensions
  • Intel® Streaming SIMD Extensions
  • OpenMP*
  • Development Tools
  • Intel® Atom™ Processors
  • Internet of Things
  • Media Processing
  • Optimization
  • Parallel Computing
  • Threading
  • Embedded
  • Laptop
  • Phone
  • Tablet
  • Desktop
  • Protected Attachments: 

    AttachmentSize
    Downloadtbb-resize-simple.cpp14.8 KB
  • URL
  • Code Sample
  • Improving performance
  • Libraries
  • Multithread development
  • IPP-Learn
  • Last Updated: 

    Tuesday, April 7, 2015

    Last Edited by: 

    Naveen Gv (Intel)

    Co-authors: 

    Naveen Gv (Intel)

    concurrent_priority_queue Template Class

    $
    0
    0

    Summary

    Template class for priority queue with concurrent operations.

    Syntax

    template<typename T, typename Alloc = cache_aligned_allocator<T> >
    class concurrent_priority_queue;

    Header

    #include "tbb/concurrent_priority_queue.h"

    Description

    A concurrent_priority_queue is a container that permits multiple threads to concurrently push and pop items. Items are popped in priority order as determined by a template parameter. The capacity of the queue is unbounded, subject to memory limitations on the target machine.

    The interface is similar to STL std::priority_queue except where it must differ to make concurrent modification safe.

    Differences Between STL queue and Intel® Threading Building Blocks concurrent_priority_queue

    Feature

    STL std::priority_queue

    concurrent_priority_queue

    Choice of underlying container

    Sequence template parameter

    No choice of underlying container; allocator choice is provided instead

    Access to highest priority item

    const value_type& top() const

    Not available. Unsafe for concurrent container

    Copy and pop item if present

    bool b=!q.empty(); if(b) { x=q.top(); q.pop(); }

    bool b = q.try_pop(x);

    Get number of items in queue

    size_type size() const

    Same, but may be inaccurate due to pending concurrent push or pop operations

    Check if there are items in queue

    bool empty() const

    Same, but may be inaccurate due to pending concurrent push or pop operations

    Members

    namespace tbb {
        template <typename T, typename Compare=std::less<T>,
                     typename A=cache_aligned_allocator<T> >
        class concurrent_priority_queue {
        public:
            typedef T value_type;
            typedef T& reference;
            typedef const T& const_reference;
            typedef size_t size_type;
            typedef ptrdiff_t difference_type;
            typedef A allocator_type;
    
            //Constructors
            concurrent_priority_queue(const allocator_type& a = allocator_type());
            concurrent_priority_queue(size_type init_capacity,
                                      const allocator_type& a = allocator_type());
            template<typename InputIterator>
            concurrent_priority_queue(InputIterator begin, InputIterator end,
                                      const allocator_type& a = allocator_type());
            concurrent_priority_queue(const concurrent_priority_queue& src,
                                      const allocator_type& a = allocator_type());
            //C++11 specific
            concurrent_priority_queue(concurrent_priority_queue&& src);
            concurrent_priority_queue(concurrent_priority_queue&& src,
                                      const allocator_type& a);
            concurrent_priority_queue(std::initializer_list<T> il,
                                      const allocator_type &a = allocator_type());
    
            //Assignment
            concurrent_priority_queue& operator=(const concurrent_priority_queue& src);
            template<typename InputIterator>
            void assign(InputIterator begin, InputIterator end);
            //C++11 specific
            concurrent_priority_queue& operator=(concurrent_priority_queue&& src);
            concurrent_priority_queue& operator=(std::initializer_list<T> il);
            void assign(std::initializer_list<T> il);
    
            void swap(concurrent_priority_queue& other);
    
            ~concurrent_priority_queue();
    
            allocator_type get_allocator() const;
    
            bool empty() const;
            size_type size() const;
    
            void push(const_reference elem);
            //C++11 specific
            void push(T&& elem);
            template<typename... Args>
            void emplace(Args&&... args);
    
            bool try_pop(reference elem);
    
            void clear();
        };
    }
    
    The following table provides additional information on the members of this template class.
    MemberDescription
    concurrent_priority_queue(const allocator_type& a = allocator_type())

    Constructs empty queue.

    concurrent_priority_queue(size_type init_capacity, const allocator_type& a = allocator_type())

    Constructs an empty queue with an initial capacity.

    template <typename InputIterator> concurrent_priority_queue(InputIterator begin, InputIterator end, const allocator_type& a = allocator_type())

    Constructs a queue containing copies of elements in the iterator half-open interval [begin, end).

    concurrent_priority_queue(std::initializer_list<T> il, const allocator_type &a = allocator_type())

    C++11 specific; Equivalent to concurrent_priority_queue(il.begin(), il.end(), a) .

    concurrent_priority_queue(const concurrent_priority_queue& src, const allocator_type& a = allocator_type())

    Constructs a copy of src. This operation is not thread-safe and may result in an error or an invalid copy of src if another thread is concurrently modifying src.

    concurrent_priority_queue(concurrent_priority_queue&& src)

    C++11 specific; Constructs a new queue by moving content from src. src is left in an unspecified state, but can be safely destroyed. This operation is unsafe if there are pending concurrent operations on the src queue.

    concurrent_priority_queue(concurrent_priority_queue&& src, const allocator_type& a)

    C++11 specific; Constructs a new queue by moving content from src using the allocator a. src is left in an unspecified state, but can be safely destroyed. This operation is unsafe if there are pending concurrent operations on the src queue.

    concurrent_priority_queue& operator=(const concurrent_priority_queue& src)

    Assigns contents of src to *this. This operation is not thread-safe and may result in an error or an invalid copy of src if another thread is concurrently modifying src.

    Returns: a reference to *this.

    concurrent_priority_queue& operator=(concurrent_priority_queue&& src);

    C++11 specific; Moves data from src to *this. src is left in an unspecified state, but can be safely destroyed. This operation is unsafe if there are pending concurrent operations on the src queue.

    Returns: a reference to *this.

    concurrent_priority_queue& operator=(std::initializer_list<T> il)

    C++11 specific; Assigns contents of the initializer list il to *this.

    Returns: a reference to *this.

    template <typename InputIterator> void assign(InputIterator begin, InputIterator end, const allocator_type&)

    Assigns contents of the iterator half-open interval [begin, end) to *this.

    void assign(std::initializer_list<T> il)

    C++11 specific; Equivalent to assign(il.begin(), il.end()) .

    ~concurrent_priority_queue()

    Destroys all items in the queue, and the container itself, so that it can no longer be used.

    bool empty() const

    Returns: true if queue has no items; false otherwise. May be inaccurate when concurrent push or try_pop operations are pending. This operation reads shared data and may trigger a race condition in race detection tools when used concurrently.

    size_type size() const

    Returns: Number of items in the queue. May be inaccurate when concurrent push or try_pop operations are pending. This operation reads shared data and may trigger a race condition in race detection tools when used concurrently.

    void push(const_reference elem)

    Pushes a copy of elem into the queue. This operation is thread-safe with other push, try_pop and emplace operations.

    void push(T&& elem)

    C++11 specific; Pushes a given element into the queue using move constructor. This operation is thread-safe with other push, try_pop and emplace operations.

    template<typename... Args> void emplace(Args&&... args);

    C++11 specific; Pushes a new element into the queue. The element is constructed with given arguments. This operation is thread-safe with other push, try_pop and emplace operations.

    bool try_pop(reference elem)

    If the queue is not empty, copies the highest priority item from the queue and assigns it to elem, and destroys the popped item in the queue; otherwise, does nothing. This operation is thread-safe with other push, try_pop and emplace operations.

    Returns: true if an item was popped; false otherwise.

    void clear()

    Clears the queue; results in size()==0. This operation is not thread-safe.

    void swap(concurrent_priority_queue& other)

    Swaps the queue contents with those of other. This operation is not thread-safe.

    allocator_type get_allocator() const

    Returns: A copy of allocator used to construct the queue.

    English

    Observers

    Container Range Concepts

    $
    0
    0

    Summary

    View set of items in a container as a recursively divisible range.

    Requirements

    A Container Range is a Range with the further requirements listed in below.

    Additional Requirements on a Container Range R

    Pseudo-Signature

    Semantics

    R::value_type

    Item type

    R::reference

    Item reference type

    R::const_reference

    Item const reference type

    R::difference_type

    Type for difference of two iterators

    R::iterator

    Iterator type for range

    R::iterator R::begin()

    First item in range

    R::iterator R::end()

    One past last item in range

    R::size_type R::grainsize() const

    Grain size

    Model Types

    Classes concurrent_hash_map and concurrent_vector both have member types range_type and const_range_type that model a Container Range.

    Use the range types in conjunction with parallel_for, parallel_reduce, and parallel_scan to iterate over items in a container.

    English

    parallel_reduce Template Function

    $
    0
    0

    Summary

    Computes reduction over a range.

    Header

     #include "tbb/parallel_reduce.h"

    Syntax

    template<typename Range, typename Value,
             typename Func, typename Reduction>
    Value parallel_reduce( const Range& range, const Value& identity,
                         const Func& func, const Reduction& reduction,
                         [, partitioner[, task_group_context& group]] );
    
    template<typename Range, typename Body>
    void parallel_reduce( const Range& range, const Body& body
                          [, partitioner[, task_group_context& group]] );

    where the optional partitioner declares any of the partitioners as shown in column 1 of the Partitioners table in the Partitioners section.

    Description

    The parallel_reduce template has two forms. The functional form is designed to be easy to use in conjunction with lambda expressions. The imperative form is designed to minimize copying of data.

    The functional form parallel_reduce(range,identity,func,reduction) performs a parallel reduction by applying func to subranges in range and reducing the results using binary operator reduction. It returns the result of the reduction. Parameter func and reduction can be lambda expressions. The table below summarizes the type requirements on the types of identity, func, and reduction.

    Requirements for Func and Reduction

    Pseudo-Signature

    Semantics

    Value Identity;

    Left identity element for Func::operator().

    Value Func::operator()(const Range& range, const Value& x)

    Accumulate result for subrange, starting with initial value x.

    Value Reduction::operator()(const Value& x, const Value& y);

    Combine results x and y.

    The imperative form parallel_reduce(range,body) performs parallel reduction of body over each value in range. Type Range must model the Range concept. The body must model the requirements shown in the table below.

    Requirements for parallel_reduce Body

    Pseudo-Signature

    Semantics

    Body::Body( Body&, split );

    Splitting constructor. Must be able to run concurrently with operator() and method join.

    Body::~Body()

    Destructor.

    void Body::operator()(const Range& range);

    Accumulate result for subrange.

    void Body::join( Body& rhs );

    Join results. The result in rhs should be merged into the result of this.

    A parallel_reduce recursively splits the range into subranges to the point such that is_divisible() is false for each subrange. A parallel_reduce uses the splitting constructor to make one or more copies of the body for each thread. It may copy a body while the body’s operator() or method join runs concurrently. You are responsible for ensuring the safety of such concurrency. In typical usage, the safety requires no extra effort.

    When worker threads are available, parallel_reduce invokes the splitting constructor for the body. For each such split of the body, it invokes method join in order to merge the results from the bodies. Define join to update this to represent the accumulated result for this and rhs. The reduction operation should be associative, but does not have to be commutative. For a noncommutative operation op, "left.join(right)" should update left to be the result of leftopright.

    A body is split only if the range is split, but the converse is not necessarily so. The figure below diagrams a sample execution of parallel_reduce. The root represents the original body b0 being applied to the half-open interval [0,20). The range is recursively split at each level into two subranges. The grain size for the example is 5, which yields four leaf ranges. The slash marks (/) denote where copies (b1 and b2) of the body were created by the body splitting constructor. Bodies b0 and b1 each evaluate one leaf. Body b2 evaluates leaf [10,15) and [15,20), in that order. On the way back up the tree, parallel_reduce invokes b0.join(b1) and b0.join(b2) to merge the results of the leaves.

    Execution of parallel_reduce over blocked_range<int>(0,20,5)

    The figure above shows only one possible execution. Other valid executions include splitting b 2 into b 2 and b 3, or doing no splitting at all. With no splitting, b 0 evaluates each leaf in left to right order, with no calls to join. A given body always evaluates one or more subranges in left to right order. For example, in the figure above, body b 2 is guaranteed to evaluate [10,15) before [15,20). You may rely on the left to right property for a given instance of a body. However, you t must neither rely on a particular choice of body splitting nor on the subranges processed by a given body object being consecutive. parallel_reduce makes the choice of body splitting nondeterministically.

    Example where Body b0 processes non-consecutive subranges.

    The subranges evaluated by a given body are not consecutive if there is an intervening join. The joined information represents processing of a gap between evaluated subranges. The figure above shows such an example. The body b0 performs the following sequence of operations:

    1. b0( [0,5) )
    2. b0.join()( b1 ) where b1 has already processed [5,10)
    3. b0( [10,15) )
    4. b0( [15,20) )

    In other words, body b0 gathers information about all the leaf subranges in left to right order, either by directly processing each leaf, or by a join operation on a body that gathered information about one or more leaves in a similar way. When no worker threads are available, parallel_reduce executes sequentially from left to right in the same sense as for parallel_for . Sequential execution never invokes the splitting constructor or method join.

    All overloads can be passed a task_group_context object so that the algorithm’s tasks are executed in this group. By default the algorithm is executed in a bound group of its own.

    Complexity

    If the range and body take O(1) space, and the range splits into nearly equal pieces, then the space complexity is O(P log(N)), where N is the size of the range and P is the number of threads.

    Example (Imperative Form)

    The following code sums the values in an array.

    #include "tbb/parallel_reduce.h"
    #include "tbb/blocked_range.h"
    
    using namespace tbb;
    
    struct Sum {
        float value;
        Sum() : value(0) {}
        Sum( Sum& s, split ) {value = 0;}
        void operator()( const blocked_range<float*>& r ) {
            float temp = value;
            for( float* a=r.begin(); a!=r.end(); ++a ) {
                temp += *a;
            }
            value = temp;
        }
        void join( Sum& rhs ) {value += rhs.value;}
    };
    
    float ParallelSum( float array[], size_t n ) {
        Sum total;
        parallel_reduce( blocked_range<float*>( array, array+n ),
                         total );
        return total.value;
    }

    The example generalizes to reduction for any associative operation op as follows:

    • Replace occurrences of 0 with the identity element for op
    • Replace occurrences of += with op= or its logical equivalent.
    • Change the name Sum to something more appropriate for op.

    The operation may be noncommutative. For example, op could be matrix multiplication.

    Example with Lambda Expressions

    The following is analogous to the previous example, but written using lambda expressions and the functional form of parallel_reduce.

    #include "tbb/parallel_reduce.h"
    #include "tbb/blocked_range.h"
    
    using namespace tbb;
    
    float ParallelSum( float array[], size_t n ) {
        return parallel_reduce(
            blocked_range<float*>( array, array+n ),
            0.f,
            [](const blocked_range<float*>& r, float init)->float {
                for( float* a=r.begin(); a!=r.end(); ++a )
                    init += *a;
                return init;
            },
            []( float x, float y )->float {
                return x+y;
            }
        );
    }

    STL generalized numeric operations and functions objects can be used to write the example more compactly as follows:

    #include <numeric>
    #include <functional>
    #include "tbb/parallel_reduce.h"
    #include "tbb/blocked_range.h"
    
    using namespace tbb;
    
    float ParallelSum( float array[], size_t n ) {
        return parallel_reduce(
            blocked_range<float*>( array, array+n ),
            0.f,
            [](const blocked_range<float*>& r, float value)->float {
                return std::accumulate(r.begin(),r.end(),value);
            },
            std::plus<float>()
        );
    }

    English

    Version Information

    $
    0
    0

    Intel® Threading Building Blocks (Intel® TBB) has macros, an environment variable, and a function that reveal version and run-time information.

    Version Macros

    The header tbb/tbb_stddef.h defines macros related to versioning, as described below. You should not redefine these macros.

    Version Macros

    Macro

    Description of Value

    TBB_INTERFACE_VERSION

    Current interface version. The value is a decimal numeral of the form xyyy where x is the major version number and y is the minor version number.

    TBB_INTERFACE_VERSION_MAJOR

    TBB_INTERFACE_VERSION/1000; that is, the major version number.

    TBB_COMPATIBLE_INTERFACE_VERSION

    Oldest major interface version still supported.

    TBB_VERSION Environment Variable

    Set the environment variable TBB_VERSION to 1 to cause the library to print information on stderr. Each line is of the form “TBB: tag value, where tag and value are described below.

    Output from TBB_VERSION

    Tag

    Description of Value

    VERSION

    Intel TBB product version number.

    INTERFACE_VERSION

    Value of macro TBB_INTERFACE_VERSION when library was compiled.

    BUILD_...

    Various information about the machine configuration on which the library was built.

    TBB_USE_ASSERT

    Setting of macro TBB_USE_ASSERT

    DO_ITT_NOTIFY

    1 if library can enable instrumentation for Intel® Parallel Studio XE and Intel® Threading Tools; 0 or undefined otherwise.

    ITT

    yes if library has enabled instrumentation for Intel® Parallel Studio XE and Intel® Threadng Tools, no otherwise. Typically yes only if the program is running under control of Intel® Parallel Studio XE or Intel® Threadng Tools.

    ALLOCATOR

    Underlying allocator for tbb::tbb_allocator. It is scalable_malloc if the Intel® TBB malloc library was successfully loaded; malloc otherwise.

    Caution

    This output is implementation specific and may change at any time.

    TBB_runtime_interface_version Function

    Summary

    Function that returns the interface version of the Intel® TBB library that was loaded at runtime.

    Syntax

    extern “C” int TBB_runtime_interface_version();
    

    Header

    #include "tbb/tbb_stddef.h"

    Description

    The value returned by TBB_runtime_interface_version() may differ from the value of TBB_INTERFACE_VERSION obtained at compile time. This can be used to identify whether an application was compiled against a compatible version of the Intel® TBB headers.

    In general, the run-time value TBB_runtime_interface_version() must be greater than or equal to the compile-time value of TBB_INTERFACE_VERSION. Otherwise the application may fail to resolve all symbols at run time.

    English

    Local Serializer

    $
    0
    0

    Context

    Consider an interactive program. To maximize concurrency and responsiveness, operations requested by the user can be implemented as tasks. The order of operations can be important. For example, suppose the program presents editable text to the user. There might be operations to select text and delete selected text. Reversing the order of "select" and "delete" operations on the same buffer would be bad. However, commuting operations on different buffers might be okay. Hence the goal is to establish serial ordering of tasks associated with a given object, but not constrain ordering of tasks between different objects.

    Forces

    • Operations associated with a certain object must be performed in serial order.

    • Serializing with a lock would be wasteful because threads would be waiting at the lock when they could be doing useful work elsewhere.

    Solution

    Sequence the work items using a FIFO (first-in first-out structure). Always keep an item in flight if possible. If no item is in flight when a work item appears, put the item in flight. Otherwise, push the item onto the FIFO. When the current item in flight completes, pop another item from the FIFO and put it in flight.

    The logic can be implemented without mutexes, by using concurrent_queue for the FIFO and atomic<int> to count the number of items waiting and in flight. The example explains the accounting in detail.

    Example

    The following example builds on the Non-Preemptive Priorities example to implement local serialization in addition to priorities. It implements three priority levels and local serializers. The user interface for it follows:

    enum Priority {
       P_High,
       P_Medium,
       P_Low
    };
    
    template<typename Func>
    void EnqueueWork( Priority p, Func f, Serializer* s=NULL );

    Template function EnqueueWork causes functor f to run when the three constraints in the following table are met.

    Implementation of Constraints

    Constraint

    Resolved by class...

    Any prior work for the Serializer has completed.

    Serializer

    A thread is available.

    RunWorkItem

    No higher priority work is ready to run.

    ReadyPileType

    Constraints on a given functor are resolved from top to bottom in the table. The first constraint does not exist when s is NULL. The implementation of EnqueueWork packages the functor in a SerializedWorkItem and routes it to the class that enforces the first relevant constraint between pieces of work.

    template<typename Func>
    void EnqueueWork( Priority p, Func f, Serializer* s=NULL ) {
       WorkItem* item = new SerializedWorkItem<Func>( p, f, s );
       if( s )
           s->add(item);
       else
           ReadyPile.add(item);
    }

    A SerializedWorkItem is derived from a WorkItem, which serves as a way to pass around a prioritized piece of work without knowing further details of the work.

    // Abstract base class for a prioritized piece of work.
    class WorkItem {
    public:
       WorkItem( Priority p ) : priority(p) {}
       // Derived class defines the actual work.
       virtual void run() = 0;
       const Priority priority;
    };
    
    template<typename Func>
    class SerializedWorkItem: public WorkItem {
       Serializer* serializer;
       Func f;
       /*override*/ void run() {
           f();
           Serializer* s = serializer;
           // Destroy f before running Serializer’s next functor.
           delete this;
           if( s )
               s->noteCompletion();
       }
    public:
       SerializedWorkItem( Priority p, const Func& f_, Serializer* s ) :
           WorkItem(p), serializer(s), f(f_)
       {}
    };

    Base class WorkItem is the same as class WorkItem in the example for Non-Preemptive Priorities. The notion of serial constraints is completely hidden from the base class, thus permitting the framework to extend other kinds of constraints or lack of constraints. Class SerializedWorkItem is essentially ConcreteWorkItem from the example for Non-Preemptive Priorities, extended with a Serializer aspect.

    Virtual method run() is invoked when it becomes time to run the functor. It performs three steps:

    1. Run the functor

    2. Destroy the functor.

    3. Notify the Serializer that the functor completed, and thus unconstraining the next waiting functor.

    Step 3 is the difference from the operation of ConcreteWorkItem::run. Step 2 could be done after step 3 in some contexts to increase concurrency slightly. However, the presented order is recommended because if step 2 takes non-trivial time, it likely has side effects that should complete before the next functor runs.

    Class Serializer implements the core of the Local Serializer pattern:

    class Serializer {
       tbb::concurrent_queue<WorkItem*> queue;
       tbb::atomic<int> count;         // Count of queued items and in-flight item
       void moveOneItemToReadyPile() { // Transfer item from queue to ReadyPile
           WorkItem* item;
           queue.try_pop(item);
           ReadyPile.add(item);
       }
    public:
       void add( WorkItem* item ) {
           queue.push(item);
           if( ++count==1 )
               moveOneItemToReadyPile();
       }
       void noteCompletion() {        // Called when WorkItem completes.
           if( ‐‐count!=0 )
               moveOneItemToReadyPile();
       }
    };
    

    The class maintains two members:

    • A queue of WorkItem waiting for prior work to complete.

    • A count of queued or in-flight work.

    Mutexes are avoided by using concurrent_queue<WorkItem*> and atomic<int> along with careful ordering of operations. The transitions of count are the key understanding how class Serializer works.

    • If method add increments count from 0 to 1, this indicates that no other work is in flight and thus the work should be moved to the ReadyPile.

    • If method noteCompletion decrements count and it is not from 1 to 0, then the queue is non-empty and another item in the queue should be moved to ReadyPile.

    Class ReadyPile is explained in the example for Non-Preemptive Priorities.

    If priorities are not necessary, there are two variations on method moveOneItemToReadyPile, with different implications.

    • Method moveOneItemToReadyPile could directly invokeitem->run(). This approach has relatively low overhead and high thread locality for a given Serializer. But it is unfair. If the Serializer has a continual stream of tasks, the thread operating on it will keep servicing those tasks to the exclusion of others.

    • Method moveOneItemToReadyPile could invoke task::enqueue to enqueue a task that invokes item->run(). Doing so introduces higher overhead and less locality than the first approach, but avoids starvation.

    The conflict between fairness and maximum locality is fundamental. The best resolution depends upon circumstance.

    The pattern generalizes to constraints on work items more general than those maintained by class Serializer. A generalized Serializer::add determines if a work item is unconstrained, and if so, runs it immediately. A generalized Serializer::noteCompletion runs all previously constrained items that have become unconstrained by the completion of the current work item. The term "run" means to run work immediately, or if there are more constraints, forwarding the work to the next constraint resolver.

    English

    Recursive Chain Reaction

    $
    0
    0

    The scheduler works best with tree-structured task graphs, because that is where the strategy of "breadth-first theft and depth-first work" applies very well. Also, tree-structured task graphs allow fast creation of many tasks. For example, if a master task tries to create N children directly, it will take O(N) steps. But with tree structured forking, it takes only O(lg(N)) steps.

    Often domains are not obviously tree structured, but you can easily map them to trees. For example, parallel_for (in tbb/parallel_for) works over an iteration space, such as a sequence of integers. Template function parallel_for uses that definition to recursively map the iteration space onto a binary tree.

    See Also

    English

    Atomic Operations

    $
    0
    0

    You can avoid mutual exclusion using atomic operations. When a thread performs an atomic operation, the other threads see it as happening instantaneously. The advantage of atomic operations is that they are relatively quick compared to locks, and do not suffer from deadlock and convoying. The disadvantage is that they only do a limited set of operations, and often these are not enough to synthesize more complicated operations efficiently. But nonetheless you should not pass up an opportunity to use an atomic operation in place of mutual exclusion. Class atomic<T> implements atomic operations with C++ style.

    A classic use of atomic operations is for thread-safe reference counting. Suppose x is a reference count of type int, and the program needs to take some action when the reference count becomes zero. In single-threaded code, you could use a plain int for x, and write --x; if(x==0) action(). But this method might fail for multithreaded code, because two threads might interleave their operations as shown in the following table, where ta and tb represent machine registers, and time progresses downwards:

    Interleaving of Machine Instructions

    Thread A

    Thread B

    ta  = x
    tb = x
    x = ta -1
    x = tb1
    if( x==0 )
    if( x==0 )

    Though the code intended for x to be decremented twice, it ends up with only one less than its original value. Also, another problem results because the test of x is separate from the decrement: If x starts out as two, and both threads decrement x before either thread evaluates the if condition, both threads would call action(). To correct this problem, you need to ensure that only one thread at a time does the decrement and ensure that the value checked by the "if" is the result of the decrement. You can do this by introducing a mutex, but it is much faster and simpler to declare x as atomic<int> and write "if(--x==0) action()". The method atomic<int>::operator-- acts atomically; no other thread can interfere.

    atomic<T> supports atomic operations on type T, which must be an integral, enumeration, or pointer type. There are five fundamental operations supported, with additional interfaces in the form of overloaded operators for syntactic convenience. For example, ++, --, -=, and += operations on atomic<T> are all forms of the fundamental operation fetch-and-add. The following are the five fundamental operations on a variable x of type atomic<T>.

    Fundamental Operations on a Variable x of Type atomic<T>

    = x

    read the value of x

    x=

    write the value of x, and return it

    x.fetch_and_store(y)

    do x=y and return the old value of x

    x.fetch_and_add(y)

    do x+=y and return the old value of x

    x.compare_and_swap(y,z)

    if x equals z, then do x=y. In either case, return old value of x.

    Because these operations happen atomically, they can be used safely without mutual exclusion. Consider the following example:

    atomic<unsigned> counter;
    
    unsigned GetUniqueInteger() {
        return counter.fetch_and_add(1);
    }
    

    The routine GetUniqueInteger returns a different integer each time it is called, until the counter wraps around. This is true no matter how many threads call GetUniqueInteger simultaneously.

    The operation compare_and_swap is a fundamental operation to many non-blocking algorithms. A problem with mutual exclusion is that if a thread holding a lock is suspended, all other threads are blocked until the holding thread resumes. Non-blocking algorithms avoid this problem by using atomic operations instead of locking. They are generally complicated and require sophisticated analysis to verify. However, the following idiom is straightforward and worth knowing. It updates a shared variable globalx in a way that is somehow based on its old value:

    atomic<int> globalx;
    
    int UpdateX() {      // Update x and return old value of x.
        do {
            // Read globalX
            oldx = globalx;
            // Compute new value
            newx = ...expression involving oldx....
            // Store new value if another thread has not changed globalX.
        } while( globalx.compare_and_swap(newx,oldx)!=oldx );
        return oldx;
    }
    

    Worse, some threads iterate the loop until no other thread interferes. Typically, if the update takes only a few instructions, the idiom is faster than the corresponding mutual-exclusion solution.

    Caution

    If the following sequence thwarts your intent, then the update idiom is inappropriate:

    1. A thread reads a value A from globalx

    2. Other threads change globalx from A to B to A

    3. The thread in step 1 does its compare_and_swap, reading A and thus not detecting the intervening change to B.

    The problem is called the ABAproblem. It is frequently a problem in designing non-blocking algorithms for linked data structures. See the Internet for more information.

    English

    Cancellation Without An Exception

    $
    0
    0

    To cancel an algorithm but not throw an exception, use the expression task::self().cancel_group_execution(). The part task::self() references the innermost Intel® TBB task on the current thread. Calling cancel_group_execution() cancels all tasks in its task_group_context, which is explained in more detail in Cancellation and Nested Parallelism. The method returns true if it actually causes cancellation, false if the task_group_context was already cancelled.

    The example below shows how to use task::self().cancel_group_execution().

    #include "tbb/tbb.h"
    #include <vector>
    #include <iostream>
    
    using namespace tbb;
    using namespace std;
    
    vector<int> Data;
    
    struct Update {
        void operator()( const blocked_range<int>& r ) const {
            for( int i=r.begin(); i!=r.end(); ++i )
                if( i<Data.size() ) {
                    ++Data[i];
                } else {
                    // Cancel related tasks.
                    if( task::self().cancel_group_execution() )
                        cout << "Index "<< i << " caused cancellation\n";
                    return;
                }
        }
    };
    
    int main() {
        Data.resize(1000);
        parallel_for( blocked_range<int>(0, 2000), Update());
        return 0;
    }
    English
    Viewing all 1148 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>