TI中文支持网
TI专业的中文技术问题搜集分享网站

AM5728 NEON加速

我用的AM5728平台,查看cpuinfo支持NEON加速,现在想用NEON优化一下程序,但是第一次用,请问有资料参考一下吗?谢谢

Shine:

请参考下面网站neon资料。
processors.wiki.ti.com/…/Cortex-A8

user5875077:

回复 Shine:

怎么访问不了?需要翻墙?

Shine:

回复 user5875077:

试试这个链接。
processors.wiki.ti.com/…/Cortex-A8

Shine:

回复 Shine:

Cortex-A8
(Redirected from Cortex A8)
Content is no longer maintained and is being kept for reference only!

Contents
1 ARM Cortex-A8 Overview & Introduction
1.1 Architecture Overview
1.2 Available Tools
1.3 What is Neon?
1.3.1 What are the advantages of Neon
1.3.2 How to develop code for Neon
1.3.3 What does a Neon assembly instruction look like
1.3.4 How to enable NEON
1.3.5 Neon Auto Vectorization Compiler directives and Example
1.3.6 Neon Intrinsics
1.3.7 Assembly
1.4 Compiler Comparison
1.5 What is VFP?
1.5.1 How to compile and run VFP code
1.5.2 What is the relationship between Neon and VFP?
1.5.3 Neon and VFP both support floating point, which should I use?
2 Useful documentation
3 Useful links
4 ARM Cortex-A8 Terminology
ARM Cortex-A8 Overview & Introduction
Cortex-A8 Features
Feature Comparison: ARM 926, 1136 and Cortex-A8
Architecture Overview
Cortex-A8 Architecture
Cortex-A8 Neon Architecture
The Cortex-A8 Microprocessor – 2 Page White Paper
Available Tools
There are a variety of tools:

Texas Instrument's CCS (good for low-level ARM and DSP debug)
ARM SW Development tools(good for low-level ARM debug)
Microsoft Visual Studio 2005 + Platform Builder plugin for the IDE debugger ( WinCE application Debug)
Lauterbach (good for low-level ARM and DSP debug, experienced with previous OMAP/AM products)
GreenHills's MULTI (good for low-level ARM and DSP debug)
Mentor Graphics Sourcery tools (good for Linux application debug)
MontaVista's Devrocket (good for Linux application debug)
If you need a tool that understands Linux, CodeSourcery and MontaVista are the way to go; at present CodeSourcery tool-chain has better support for Cortex-A8 found in OMAP 35x / AM 35x devices.

What is Neon?
According to ARM, the Neon block of the Cortex-A8 core includes both the Neon and VFP accelerators. Neon is a SIMD (Single Instruction Multiple Data) accelerator processor integrated in as part of the ARM Cortex-A8. What does SIMD mean? It means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel. It is also synonymous with the term vector processor. Since there is parallelism inside the Neon, you can get more MIPS or FLOPS out of Neon than you can a standard SISD processor running at the same clock rate. Many Neon benchmarks are shown as ARM takes N instructions while Neon takes less than N instructions. This shows how much parallelism can be achieved for that benchmark. Reducing instruction count will reduce the number of clocks used to perform the same task. A simple rule of thumb for how fast Neon will speed up a specific loop is to look at the data size of the operation. Since the largest Neon register is 128 bits, if you are performing an operation on 8-bit values you can perform 16 operations simultaneously. On the other end of the spectrum, if you are using 32 bit data, then you can perform 4 operation simultaneously. However remember that there are always other considerations that affect execution speed such as memory throughput and loop overhead. Neon instructions are mainly for numerical, load/store, and some logical operations. Neon operations will be executing in the NEON pipline while other instruction such as branching will occur in the main ARM core pipeline. (See reference to Cortex-A8 Architecture above for a description of the ARM Cortex-A8 and NEON pipelines)

What are the advantages of Neon
Aligned and unaligned data access allows for efficient vectorization of SIMD operations.
Support for both integer and floating point operations ensures adaptability to a broad range of applications, from compression decoding to 3D graphics.
Tight coupling to the ARM core provides a single instruction stream and a unified view of memory, presenting a single development platform target with a simpler tool flow.
The large Neon register file with its multiple views enables efficient handling of data and minimizes access to memory, enhancing data throughput performance.
Neon Advantages details

How to develop code for Neon
Unfortunately, in most cases you can not simply compile general C code and get a huge speed up using Neon. But if you truly want to utilize the power of Neon there are some basic steps you can follow. You need some basic understanding of what it means to vectorize the code. You need to know how to enable Neon in the Cortex-A8. Also, you need to have L2 cache enabled to get appreciable speed increases.

Compiler Options – you can direct the compiler to auto-vectorize: The compiler generates Neon code. See the Neon auto vectorization example.
Neon intrinsics – Compileable macros that give low level access to Neon operations
Assembly Code – Write your own assembly or link highly optimized libraries.
For Intrinsics, ARM has created a NEON support library called NE10 to make your jump into NEON easier: blogs.arm.com/…/703-ne10-a-new-open-source-library-to-accelerate-your-applications-with-neon

What does a Neon assembly instruction look like
A Neon instruction would look like one of the following:

VMUL.I16 q0,q0,q1

VMUL – multiply assembly instruction
.I16 – Indicates this instruction operates on 16 bit integers. This would be a "short int" in C code.
q0,q1 – Neon registers. A 'q' register is 128 bits wide and will hold 8 short ints.
This Neon instruction would simultaneously multiply the 8 operands in q0 with the 8 operands in q1 and store the 8 results in q0.

How to enable NEON
The NEON/VFP unit comes up disabled on power on reset. To enable Neon requires some co-processor commands. Below is the assembly code required to enable NEON/VFP. It is in the gcc type syntax. ARM code tools use a slightly different syntax.

MRC p15, #0, r1, c1, c0, #2 ; r1 = Access Control RegisterORR r1, r1, #(0xf << 20) ; enable full access for p10,11MCR p15, #0, r1, c1, c0, #2 ; Access Control Register = r1MOV r1, #0MCR p15, #0, r1, c7, c5, #4 ; flush prefetch buffer because of FMXR below; and CP 10 & 11 were only just enabled; Enable VFP itselfMOV r0,#0x40000000FMXR FPEXC, r0 ; FPEXC = r0
Neon Auto Vectorization Compiler directives and Example
Compiler tools Autovectorization compiler directives
Code Composer Studio "-o3 -mv7a8 –neon -mf "
CodeSourcery (gcc) "-march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=softfp"
Realview "–cpu=Cortex-A8 -O3 -Otime –vectorize"

Here is a simple example of autovectorizing a small C function using the Realview compiler

File: itest.cvoid NeonTest(short int * __restrict a, short int * __restrict b, short int * __restrict z){int i;for (i = 0; i < 200; i++) {z[i] = a[i] * b[i];}}

ARM only code

generated by ARM/Thumb C/C++ Compiler, RVCT3.1 [Build 616]

commandline armcc [-c –asm –interleave –cpu=Cortex-A8 itest.c]

loop iterations: 200

000000e92d4030PUSH{r4,r5,lr}
000004e3a03000MOVr3,#0|L1.8|
000008e0804083ADDr4,r0,r3,LSL #1
00000ce0815083ADDr5,r1,r3,LSL #1
000010e1d440b0LDRHr4,[r4,#0]
000014e1d550b0LDRHr5,[r5,#0]
000018e1640584SMULBBr4,r4,r5
00001ce0825083ADDr5,r2,r3,LSL #1
000020e2833001ADDr3,r3,#1
000024e35300c8CMPr3,#0xc8
000028e1c540b0STRHr4,[r5,#0]
00002cbafffff5BLT|L1.8|
000030e8bd8030POP{r4,r5,pc}

ARM + Neon code

generated by ARM/Thumb NEON C/C++ Compiler with Crescent Bay VAST 10.7z8 ARM NEON, RVCT3.1 [Build 616]

commandline armcc [-c –asm –interleave –cpu=Cortex-A8 -O3 -Otime –vectorize itest.c]

loop iterations: 25

Neon instructions

000000e3a03019MOVr3,#0x19|L1.4|
000004f4200a4dVLD1.16{d0,d1},[r0]!
000008e2533001SUBSr3,r3,#1
00000cf4212a4dVLD1.16{d2,d3},[r1]!
000010f2100952VMUL.I16 q0,q0,q1
000014f4020a4dVST1.16{d0,d1},[r2]!
0000181afffff9BNE|L1.4|
00001ce12fff1eBXlr
Neon Intrinsics
You may find that using the autovectorizing compiler does not always work well for more complex functions. Intrinsics are a combination of assembly code and C code. They give you direct control over the Neon SIMD functionality similar to coding in assembly. They also give you C level compiler errors to warn you if you are not matching type inputs and output consistently.

Here is a small function in C which adds together the 200 corresponding elements from arrays x and y and stores each result in array z:void NeonTest(int * x, int * y, int * z){int i;for(i=0;i<200;i++) {z[i] = x[i] + y[i];}}
Here is the equivalent function using intrinsics:

#include "arm_neon.h"
void intrinsics(uint32_t *x, uint32_t *y, uint32_t *z)
{int i;uint32x4_t x4,y4; // These 128 bit registers will contain 4 values from the x array and 4 values from the y arrayuint32x4_t z4;// This 128 bit register will contain the 4 results from the add intrinsicuint32_t *ptra = x; // pointer to the x array datauint32_t *ptrb = y; // pointer to the y array datauint32_t *ptrz = z; // pointer to the z array datafor(i=0; i < 200/4; i++){x4 = vld1q_u32(ptra);// intrinsic to load x4 with 4 values from xy4 = vld1q_u32(ptrb);// intrinsic to load y4z4=vaddq_u32(x4,y4);// intrinsic to add z4=x4+y4vst1q_u32(ptrz, z4);// store the 4 results to zptra+=4; // increment pointersptrb+=4;ptrz+=4;}
}
Here is the output of the intrinsic function compiled with: GCC: (CodeSourcery Sourcery G++ Lite 2007q3-51) 4.2.1

22 0000 323E81E2addr3, r1, #800
23.L2:
24 0004 8F4A21F4vld1.32{d4-d5}, [r1]
25 0008 101081E2addr1, r1, #16
26 000c 8F6A20F4vld1.32{d6-d7}, [r0]
27 0010 030051E1cmpr1, r3
28 0014 446826F2vadd.i32 q3, q3, q2
29 0018 8F6A02F4vst1.32{d6-d7}, [r2]
30 001c 100080E2addr0, r0, #16
31 0020 102082E2addr2, r2, #16
32 0024 F6FFFF1Abne.L2
33 0028 1EFF2FE1bxlr
Assembly
Coding in Assembly is a last resort. If autovectorization and intrinsics are not getting the desired results, then hand coding in assembly can be the way to maximize a functions performance. Coding in assembly that will improve on compiled code is a skill that may require a considerable learning curve.

Compiler Comparison
Compiler capabilities Autovectorization Intrinsics Assembly
Code Composer Studio Yes No Yes
CodeSourcery (gcc) Yes Yes Yes
Realview Yes Yes Yes

Note: For Code Composer you need version 4.6.x or greater of the TMS470 Compiler Tools

What is VFP?
VFP is a floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It's purpose is to speed up floating point calculations. If a processor like ARM does not have floating hardware, then it relies on software math libraries which can prohibitively slow down floating point calculations. The VFP supports both single and double precision floating point calculations compliant with IEEE754. Further, the VFP is not fully pipelined like Neon, so it will not have equivalent performance to Neon.

How to compile and run VFP code
What is the relationship between Neon and VFP?
Neon and VFP share the same large register file inside of the Cortex-A8. These registers are separate from the ARM core registers. The Neon/VFP register file is 256 bytes as shown in the diagram.

Neon/VFP register File
The Neon Register file has a dual view:

32 – 64 bit registers (The Dx registers)
16 – 128 bit registers (The Qx registers)
The VFP Register file also has a dual view:

32 – 64 bit registers (The Dx registers)
32 – 32 bit registers (The Sx registers – Only 1/2 of the registers may be viewed as 32 bit)
From the Neon point of view: register Q0 may be accessed as either Q0 or D0:D1

From the VFP point of view: register D0 may be accessed as either D0 or S0:S1

There are 2 paths or pipelines through Neon:

Integer and fixed point (supports 8 bit, 16 bit, 32 bit integers)
Single precision floating point (supports 32 bit floating point)
VFP has a single path:

Single or double precision floating point (supports 32 bit and 64 bit floating point)
(Note that Neon does not support double precision floating point operations)

(Note that Neon and VFP both support single precision floating point operations)

Neon and VFP both support floating point, which should I use?
The VFPv3 is fully compliant with IEEE 754
Neon is not fully compliant with IEEE 754, so it is mainly targeted for multimedia applications
Here is an example of showing how Neon pipelining will outperform VFP:

Taking the same C function from earlier, but using floating point types instead:
void NeonTest(float * __restrict a, float * __restrict b, float * __restrict z){int i;for(i=0;i<200;i++) {z[i] = a[i] * b[i];}}
Compile the above code using CodeSourcery: GCC: (CodeSourcery Sourcery G++ Lite 2007q3-51) 4.2.1

Compile the above function for both Neon and VFP and compare results:

arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -ftree-vectorize -mfloat-abi=softfp
Running on OMAP3EVM under Linux with a Cortex-A8 clock speed of 600MHz

VFP/NEON Time to execute this function 500,000 times
VFP 7.36 seconds
Neon 0.94 seconds
Useful documentation
The TRM is a large document, but contains good information to answer many questions. You can get the TRM on the ARM website: ARM's Main Website However you need to know what revision of the Cortex-A8 you have. You can find out how to read that information here: How to Find the Cortex-A8 Revision of your OMAP35x
There are many useful App Notes by ARM
App Note 133 – Using VFP with RVDS
App Note 178 – Building Linux Applications Using RVDS 3.1 and the GNU Tools and Libraries
App Note 150 – Building Linux Applications Using RVDS 3.0 and the GNU Tools and Libraries
Other valuable ARM documents
ARM Information Center
Realview Compilation Tools: Assembler Guide
Useful links
TI Open Source Projects – Cutting edge information

ARM Cortex-A8 Terminology
SIMD – A processor capable of Single Instruction Multiple Data. For example during one single operation such as an "add", up to 16 sets of data will be added in parallel.
Superscalar – An architecture which employs instruction level parallelism. The Cortex-A8 has dual in-order instruction issue.
Vector Processor – Synonymous with SIMD processor

yongqing wang:

neon 的教程基本透视通用的,自己自己使用汇编写效果最好,也有开源的c库:projectne10.github.io/…/

yongqing wang:

回复 yongqing wang:

参考一下这个介绍:developer.arm.com/…/neon

赞(0)
未经允许不得转载:TI中文支持网 » AM5728 NEON加速
分享到: 更多 (0)