Fórunshttps://software.intel.com/pt-br/view/forum-page-default/36968?language=en
pt-brIntel® Modern Code Developer Communityhttps://software.intel.com/pt-br/forums/intel-moderncode-for-parallel-architectures/topic/562073?language=en
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Intel launches the new <a href="http://software.intel.com/moderncode">Intel® Modern Code Developer Community </a>- check out the new site.</p>
<p>The Modern Code Developer program applies multi-level parallelism as the framework, that uses all of the parallel performance features available on modern hardware via vectorization, multi-threading, and multi-node optimizations. Explore how to deliver multi-level parallel algorithms that effectively scale forward for today’s and tomorrow’s hardware.</p>
<p>Check out the <a href="http://software.intel.com/en-us/code-modernization/library">Modern Code Library </a>for technical solutions and information.</p>
<p>This is the primary forum for all things parallel.</p>
</div></div></div>Tue, 14 Jul 2015 17:31:09 +0000Mike P. (Intel)562073 at https://software.intel.comHow to get flops of my application with intel SDE?https://software.intel.com/pt-br/node/758170?language=en
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Hi:</p>
<p> I want to get the FLOPS (amount of computation) of my application using SDE(Intel Software Development Emulator) , I read through the guide information on the page <a href="https://software.intel.com/en-us/articles/intel-software-development-emulator?page=1#BASIC,and">https://software.intel.com/en-us/articles/intel-software-development-emu...</a> then I run sde -mix -- myapplication.exe; When I open the output file ,there are two many numbers;so my question is that how to get the total FLOPS of my application?</p>
<p> Thanks;</p>
</div></div></div><div class="field field-name-field-attachments field-type-file field-label-hidden"><div class="field-items"><div class="field-item even"><table class="sticky-enabled">
<thead><tr><th>Anexo</th><th>Tamanho</th> </tr></thead>
<tbody>
<tr class="odd"><td><span class="file"><a href="https://software.intel.com/sites/default/files/managed/2b/b9/sde-mix-out.txt" class="button-cta" type="text/plain; length=792033">Download</a><img class="file-icon" typeof="foaf:Image" src="https://software.intel.com/sites/all/themes/isn3/css/images/attachment_icon.png" alt="text/plain" title="text/plain" /> <a href="https://software.intel.com/sites/default/files/managed/2b/b9/sde-mix-out.txt" type="text/plain; length=792033">sde-mix-out.txt</a></span></td><td>773.47 KB</td> </tr>
</tbody>
</table>
</div></div></div>Tue, 27 Feb 2018 12:14:37 +0000zhang t.758170 at https://software.intel.comdiagnosing unexpectedly poor scaling among IO and compute threadshttps://software.intel.com/pt-br/forums/intel-moderncode-for-parallel-architectures/topic/758125?language=en
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Hi, I have a C# workload built against .NET 4.6.1 using the TPL (task parallel library) on Win 10 Fall Creators Update (.NET 4.7.1). Two IO tasks run in parallel to p/invoke CreateFile and ReadFile to read the first 8 kB from each file in an array from an SSD. Two compute tasks then pick up these 8k chunks and call through a C++/CLI layer into C++ for some SIMD number crunching and stay in ring 3. The IO tasks share a C# lock statement to increment the file array index and the compute tasks use a second, independent lock to increment through the chunks. The compute tasks have some brief delay logic in case compute gets ahead of IO but, as IO is consistently faster, instrumentation shows it fires at most once at the start of processing.</p>
<p>As a baseline, I'm profiling a 6000 file case. First running the two IO tasks, waiting for them to complete, and then the two compute tasks gives 1.6 seconds for the IO (3800 file reads/s) and 1.8 s for compute (3400 file chunks/s). Since the operation is sequential, total time is 3.4 s for an all up throughput of 1800 files/s. Since the test processor is dual core and hyperthreaded (i5-4200U Haswell), one would expect running all four tasks in parallel would complete in close the 1.8 s limiting duration for compute. Unfortunately, this isn't what happens. Instead, what occurs is IO completes in 1.7 s (a drop to 3500 files/s) and compute degrades from 1.8 to 2.6 s, a rather precipitous drop from 3400 to only 2300 files/s. While this is still a decent improvement over the 1800 files/s of sequential operation it leaves two threads running on four logical processors for 900 ms after IO completes but the compute tasks are still running. One might reasonably expect spinning up two more compute tasks at this point would shorten this period to 450 ms, since that doubles processor resources allocated to CPU bound work. But that's not what happens. Instead of pushing overall throughput up to 2800 files/s the compute time remains 2.6 s despite the extra processing power. Curiously, a single compute task with nothing else running also takes about 2.6 s even though inspection of performance counters shows good load balancing between two compute tasks.</p>
<p>Additionally, this is a best case. Sometimes performance drops as low as 1300 files/s. Sometimes this seems attributable to other system load but most of the time there's a drop to 1800 files/s with no other obvious load on the box. From some experiments with setting thread affinity, it appears this drop is attributable to both IO tasks landing on one core and both compute tasks on the other. The more typical case of 2300 files/s seems to be associated with each core running one IO task and one compute task.</p>
<p>I've attempted to have a look in VTune (Parallel Studio XE 2018 Update 1) but it consistently BSODs the box shortly after the target executable starts, so no information is available from it. However, bandwidth here is only about 30MB/s so I'd not expect any trouble with L3 and all operations have sequential stride so 4k aliasing shouldn't be an issue. Profiling in Visual Studio 2017 indicates only expected CPU hotspots shows no contention over either the IO or compute lock. Oddly, VS does indicate some shared handle contention at the C++/CLI to C++ and p/invoke sites but the numbers are inconsistent with to the observed delays and inspection of the release build disassembly shows no critical regions at these points. So I suspect this is just VS indicating the insertion of its contention instrumentation. Also, the codebase contains another quad thread SIMD workload which runs through the same classes but doesn't exhibit this scalability problem (it runs about twice as fast on four logical processors as on two, as expected). The difference is that workload initiates from a single threaded C++/CLI transition and then invokes concurrency::parallel_for at the C++ level.</p>
<p>Any suggestions as to how else to take this apart to try to figure out what's going on? Pushing the compute tasks into C++ or C++/CLI isn't really an option as, in addition to the SIMD, they need to update data structures defined in a dependent C# assembly and also make some computationally lightweight but functionally critical C# calls into managed dependencies.</p>
</div></div></div>Sun, 25 Feb 2018 20:10:41 +0000Todd W.758125 at https://software.intel.comloop optimization for non-uniform access to an arrayhttps://software.intel.com/pt-br/node/757255?language=en
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Hi,</p>
<p>I have following kind of loop that I am looking to optimize for intel compiler 18.0</p>
<p>SomeData* sourcePtr = GetMySourceSomeData();</p>
<p>SomeData* saPtr = GetMySomeDataPointer();</p>
<p>int size = GetSizeSomeData(saPtr);</p>
<p>// indexArray has series of indices based on some business logic, can be considered random.</p>
<p>int* indexArray = GetRandomIndexArray();</p>
<p>SomeData zero = SomeData(0);</p>
<p> </p>
<p>//below loop needs to be optimized.</p>
<p><strong>for (int point = 0; point &lt; size; indexArray++, saPtr++, point++)</strong></p>
<p><strong>{</strong></p>
<p><strong> *(saPtr) = (GetDecisionMaker(point))?sourcePtr[*(indexArray)]:zero;</strong></p>
<p><strong>}</strong></p>
<p> </p>
<p>// GetDecisionMaker(point) returns a boolean value based on some business logic, can be considered random.</p>
<p> </p>
<p>With intel compiler 13.0 we had a good performance, but with 18.0 we don't get a good performance.</p>
<p> </p>
<p>All help is welcome!</p>
<p>Thanks.</p>
</div></div></div>Thu, 15 Feb 2018 13:37:50 +0000Abhishek S.757255 at https://software.intel.comInstalling parallel studio student editionhttps://software.intel.com/pt-br/forums/intel-moderncode-for-parallel-architectures/topic/755442?language=en
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Hello,</p>
<p>I am trying to install to parallel studio xe edition 2018 on my Linux machine. However, I get following message </p>
<p>"Installation program failed to establish internet connection for activation.<br />
Product activation using a serial number requires a working Internet connection <br />
to the Intel(R) Software Development Products Registration Center. If Internet<br />
connection fails and you cannot fix the problems, you need to activate your<br />
product offline." </p>
<p>Although my internet connection is working fine. </p>
<p>How do I activate my product offline? Where do I get the license file?</p>
<p>Thanks in advance,</p>
<p>-Sara</p>
</div></div></div>Sun, 04 Feb 2018 07:29:44 +0000r, saravana755442 at https://software.intel.comQuestion on SIMDhttps://software.intel.com/pt-br/forums/intel-moderncode-for-parallel-architectures/topic/755194?language=en
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>I have a question that I haven't had much luck finding the answer to. What I have gathered on-line is that the SIMD instructions(SSE and AVX variants) will allow 2 64bit chunks on xmm0 to have an AND operation on xmm1 and the two chunks will be processed in parallel. My question is would that also pertain to the registers themselves? For instance A0⊕B0, A1⊕B1, A2⊕B2, and A3⊕B3. Would each of these data sets need to be run one at a time or would they be run in parallel? I'm mainly interested in the logic and move instructions. I'm learning how to deal with large data sets so the ability to process more data faster is always helpful. If any of the instructions need assembly that's not a problem. The program will be in assembly.</p>
<p> </p>
<p>EDIT: Never mind I think I found my answer. Would delete post but I am new here and can't figure out how.</p>
</div></div></div>Mon, 29 Jan 2018 19:02:00 +0000Whitteker, Don755194 at https://software.intel.comOpenMP Thread Affinity Implementation Xeon Phihttps://software.intel.com/pt-br/node/755178?language=en
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Hi All,</p>
<p>I am using Xeon Phi configured with Intel Parallel Studio that provide OpenMP libraries too. I am wondering what is the best way to understand how the "Balanced" affinity was implemented in the *code* of OpenMP and which source should I refer for that. </p>
<p>If anyone can answer following, it will help:</p>
<p>1) Does Intel Parallel Studio uses Intel OpenMP source code listed here: <a href="https://www.openmprtl.org/">https://www.openmprtl.org/</a><br />
2) What is the best way to understand how balanced was implemented in OpenMP source code by Intel?<br />
3) What should be the approach if one wants to implemented its own thread affinity scheme in OpenMP and get it working with Intel Caffe?</p>
<p> Thanks.</p>
</div></div></div>Mon, 29 Jan 2018 04:46:52 +0000Chetan Arvind Patil755178 at https://software.intel.comHow can allocate memory within socket(NUMA node) manually in multi socket system?https://software.intel.com/pt-br/forums/intel-moderncode-for-parallel-architectures/topic/755067?language=en
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>I have a quadra-socket system on RHEL and I'm testing a job that takes about 1-day runtime.</p>
<p>So it's important to bind the process within sockets because NUMA node makes a runtime variation.</p>
<p>The problem is when the job takes about 50% of node 0's memory, OS makes the job use node 3's memory.</p>
<p>I used "taskset -c", but it's showing the same result.</p>
<p>Can I make a job using node 0's memory fully and then use another node's memory?</p>
</div></div></div>Fri, 26 Jan 2018 06:50:40 +0000Soh, Mingyun755067 at https://software.intel.comGEMM and First Touch Memory Allocationhttps://software.intel.com/pt-br/forums/intel-moderncode-for-parallel-architectures/topic/754817?language=en
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>I am doing some experiments with various implementations that compute the GEMM algorithm C &lt;- alpha*AB + beta*C. One topic that did come up is if using memory locality had any impact in performance on a Sandy Bridge Xeon node. The following is one implementation of GEMM (I have many others, including MKL's SGEMM(...)):</p>
<pre class="brush:fortran;">call RANDOM_NUMBER(A)
call RANDOM_NUMBER(B)
call RANDOM_NUMBER(C)
!$OMP PARALLEL DO SHARED(A,B,C) PRIVATE(i,j,l,sum)
do j = 1, n
do i = 1, m
sum = 0
do l = 1, k
sum = sum + A(i,l)*B(l,j)
enddo
C(i,j) = alpha*sum + beta*C(i,j)
enddo
enddo
!$OMP END PARALLEL DO</pre><p>This code tries to use stride-1 memory when possible, but it clearly isn't possible for matrix A. Since the random_number(:) is called outside a parallel region, all 3 matrices are located on thread 0. This should make memory access quite expensive, especially when using threads from a different socket (ie threads 8-15 when using 16 threads).</p>
<p>Some observations about how the loop works:</p>
<p>1) Matrix A is needed by every thread in full, as it has no dependence on loop J (ie the loop being distributed to all threads)</p>
<p>2) Only columns of B and C are needed on each thread</p>
<p>A potential solution using First Touch using static scheduling option could be this:</p>
<pre class="brush:fortran;">!$OMP PARALLEL SHARED(A,B,C,m,n) PRIVATE(i,j)
! Matrix A is needed on every thread, so we spread out rows of A evenly to all threads
!$OMP DO
do i = 1, m
call RANDOM_NUMBER( A(i,:) )
end do
!$OMP END DO
! Only columns of B and C are needed on each thread
!$OMP DO
do j = 1, n
call RANDOM_NUMBER( B(:,j) )
call RANDOM_NUMBER( C(:,j) )
end do
!$OMP END DO
!$OMP END PARALLEL</pre><p>... Continue algorithm ...</p>
<p>Giving the threads near perfect access of 2/3 matrices hardly had any performance impact. I observed a 1-2% difference in execution times, which isn't conclusive that the First Touch allocation is doing anything. Even when we tried the same thing with a pre-transposed matrix A to get stride-1 access when computing the matrix multiply like:</p>
<p>Atr = TRANSPOSE(A) ! now use Atr(l, i) instead of A(i,l) in the do loop</p>
<p>we got similar percentage differences in run times. Experiments were attempted on square matrices A,B,C with the size of 5000x5000.</p>
<p>Does anyone have any insight as to what can be done?</p>
</div></div></div>Mon, 22 Jan 2018 18:48:48 +0000Groth, Brandon754817 at https://software.intel.comArgument mismatch for ierr, size and rank in MPI callhttps://software.intel.com/pt-br/forums/intel-moderncode-for-parallel-architectures/topic/753900?language=en
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Hi,</p>
<p>When I moved my code to a new computer (and after changing from integer to integer*4 for ierr, size and rank) I get compile warnings (attached) and runtime errors (attached) for the data type of ierr, size and rank.</p>
<p>I have reduced the source code to just a few lines (attached).</p>
<p>Best regards</p>
<p>Anders S</p>
</div></div></div><div class="field field-name-field-attachments field-type-file field-label-hidden"><div class="field-items"><div class="field-item even"><table class="sticky-enabled">
<thead><tr><th>Anexo</th><th>Tamanho</th> </tr></thead>
<tbody>
<tr class="odd"><td><span class="file"><a href="https://software.intel.com/sites/default/files/managed/ba/16/source%20code.PNG" class="button-cta" type="image/png; length=6858">Download</a><img class="file-icon" typeof="foaf:Image" src="https://software.intel.com/sites/all/themes/isn3/css/images/attachment_icon.png" alt="image/png" title="image/png" /> <a href="https://software.intel.com/sites/default/files/managed/ba/16/source%20code.PNG" type="image/png; length=6858">source code.PNG</a></span></td><td>6.7 KB</td> </tr>
<tr class="even"><td><span class="file"><a href="https://software.intel.com/sites/default/files/managed/64/56/compile%20error.PNG" class="button-cta" type="image/png; length=78026">Download</a><img class="file-icon" typeof="foaf:Image" src="https://software.intel.com/sites/all/themes/isn3/css/images/attachment_icon.png" alt="image/png" title="image/png" /> <a href="https://software.intel.com/sites/default/files/managed/64/56/compile%20error.PNG" type="image/png; length=78026">compile error.PNG</a></span></td><td>76.2 KB</td> </tr>
<tr class="odd"><td><span class="file"><a href="https://software.intel.com/sites/default/files/managed/a8/ca/Runtime%20error.PNG" class="button-cta" type="image/png; length=19797">Download</a><img class="file-icon" typeof="foaf:Image" src="https://software.intel.com/sites/all/themes/isn3/css/images/attachment_icon.png" alt="image/png" title="image/png" /> <a href="https://software.intel.com/sites/default/files/managed/a8/ca/Runtime%20error.PNG" type="image/png; length=19797">Runtime error.PNG</a></span></td><td>19.33 KB</td> </tr>
</tbody>
</table>
</div></div></div>Thu, 21 Dec 2017 17:18:42 +0000Anders S.753900 at https://software.intel.com