Blitter Timing

SCPCD · August 12, 2007

This topic goal is to give informations about different timing acces for the blitter and to know more about blitter operating.

First of all, the exemple code :

    move.l        #PITCH1|PIXEL32|WID128|XADDPHR,d0
    moveq        #0,d1

    move.l        d0,A2_FLAGS
    move.l        #source,A2_BASE
    move.l        d1,A2_PIXEL
    move.l        d1,A2_STEP
    
    move.l        d0,A1_FLAGS
    move.l        #destination,A1_BASE
    move.l        d1,A1_PIXEL
    move.l        d1,A1_FPIXEL
    move.l        d1,A1_STEP
    move.l        d1,A1_FSTEP
    move.l        d1,A1_CLIP
    move.l        d1,A1_INC
    move.l        d1,A1_FINC
    
    move.l        #$00010400,B_COUNT
    move.l        #SRCEN|LFU_REPLACE,B_CMD

configure the blitter in 32-bit by pixel and transfert in Phrase mode.

Source & destination will be different for each case that are describe bellow.

We blitt : $0001 * $0400 * 4 (because 32-bit pixel selected) = 4096bytes.

SRCEN : activation of a source, and LFU_REPLACE for a simple data copy.

all other blitter register are not used and initialised here to zero.

The GPU->DRAM transfert in phrase mode :

source : $F03000 (G_RAM)

destination : somewhere into the DRAM (phrase aligned)

Result : 11 cycles per phrase -> 4096*11/8 = 5632 cycles for 4K

The DRAM->GPU transfert in phrase mode :

source : somewhere into the DRAM (phrase aligned)

destination : $F03000 (G_RAM)

Result : 7 cycles per phrase -> 4096*7/8 = 3584 cycles for 4K

The DRAM->GPU speed transfert in phrase mode :

source : somewhere into the DRAM (phrase aligned)

destination : $F03000+$8000 (G_RAM+$8000)

Result : 5 cycles per phrase -> 4096*5/8 = 2560 cycles for 4K

-----------------------------------------------

other information in the futur

If you have a special timing to mesure, I can help

SebRmv · August 13, 2007

interesting! and what about the pixel mode?

in phrase mode, is it the same result if using 8 bpp, 16 bpp or 32 bpp?

SCPCD · August 13, 2007

I will try soon

SCPCD · August 15, 2007

interesting! and what about the pixel mode?

in pixel mode we have same cycle time but insteed of a phrase it's a pixel size that is transfered :

- in PIXEL1 there is 64 DRAM read

- in PIXEL2 there is 32 DRAM read

- in PIXEL4 there is 16 DRAM read

- in PIXEL8 there is 8 DRAM read

- in PIXEL16 there is 4 DRAM read

- in PIXEL32 there is 2 DRAM read

the number of cycle for each read/write is exactly the same as before depending of the source & destination.

So for PIXEL16 in pixel mode, this is 4 times slower than in phrase mode.

in phrase mode, is it the same result if using 8 bpp, 16 bpp or 32 bpp?

yes it's exactly the same result.

(I haven't verified that datas are correct for less than 32bpp mode from & to the GPU)

Symmetry of TNG · August 18, 2007

Nice!... =)

So.. you are saying that its slower to copy data FROM gpu mem 2 dram (5632) than it is to copy TO gou mem(3584)?

Any ideas why? I dont realy understand that from a logical point of view.. ? ..does the DRAM timings differ in read vs write? ...sram should have same read & write time right? ..how about dram?

or is it an OP stall, page miss on writes to dram or what? why is it slower to copy TO dram than it is to copy from it?

What are the GPU dooing in your example code? ...Im just curious because I have always wondered what happens when you start a GPU->dram blitt and continue running GPU code afterwards... I mean logically someone would have to halt to give the blitter acceess to gpumem right?... ie a gpu2-dram blitt would halt gpu exec and parallelism would be lost, or?

Can this be verified with your LA? ....ie faster blitts from gpumem if gpu is turned off. (or perhapps its just a stall issue, once the blitt start it finishes in same time?!)

Its nice to see that the +$8000 is realy faster, because in my experiments (iirc the tunnel code) It had no effect at all.. not something that could be seen in fps count anyway.. (or "raster bar timings", they were same size).

Last question:

When i did my own timing test of the memory (simple color "rasters" on screen) the DRAM->DRAM blitt code was As Fast as anything that had to do with the GPU memory!... I assume src&desr was within the same Page in dram hence the speed..

But it would be nice to se such a timing... ie blitter phrase blitt of dram->dram, preferabely within the same page. ...well offcourse it would be nice to see what happens if blitt crosses page boundary so there will be a page miss on each phrase.. (then perhapps we will se my point on why cleverly written GPU sdram code would be faster than the "running from main ram" thingy..

Perhapps this is because dram->dram will be true 64bits, while dram->gpumem would have to be 2*32bits..

Ie keeping the page boundaries in mind when you design your code will give you the ultimate power of the jaguar!.

sorry for all question, but timing issues are interesting from an optimisation point of view

Nice work!

cheers

/Sym

SCPCD · August 20, 2007

So.. you are saying that its slower to copy data FROM gpu mem 2 dram (5632) than it is to copy TO gou mem(3584)?
Any ideas why? I dont realy understand that from a logical point of view.. ? ..does the DRAM timings differ in read vs write? ...sram should have same read & write time right? ..how about dram?

or is it an OP stall, page miss on writes to dram or what? why is it slower to copy TO dram than it is to copy from it?

I think that it's a pipelining effect that takes different time for a read or a write :

a read from a slow memory to a fast memory is easily pipelined. The opposit is not so easy to avoid lose of cycles.

What are the GPU dooing in your example code?

GPU don't work in my example

I'll try soon whith a code.

Im just curious because I have always wondered what happens when you start a GPU->dram blitt and continue running GPU code afterwards... I mean logically someone would have to halt to give the blitter acceess to gpumem right?... ie a gpu2-dram blitt would halt gpu exec and parallelism would be lost, or?
Can this be verified with your LA? ....ie faster blitts from gpumem if gpu is turned off. (or perhapps its just a stall issue, once the blitt start it finishes in same time?!)

When the GPU execute code from internal memory, it uses 50% MAX of the internal memory bandwidth. (1 instruction per cycle, and prefetch of 2 instructions, -> 1 memory access each 2 cycles)

So blitt when gpu run at is higher speed reduce the blitt speed.

I can verified it with the LA

Last question:
When i did my own timing test of the memory (simple color "rasters" on screen) the DRAM->DRAM blitt code was As Fast as anything that had to do with the GPU memory!... I assume src&desr was within the same Page in dram hence the speed..

But it would be nice to se such a timing...

I will do it someday

sorry for all question, but timing issues are interesting from an optimisation point of view

Exactly !

Nice work!

Thanks

SebRmv · February 2, 2008

Ok, not really related but I have a question about the blitter:

do the registers are double buffered?

(this would mean that we can prepare the next blit

while the current one is being completed)

if yes, which ones exactly?

SCPCD · February 2, 2008

I'm not sure, but I think that neither of blitter's regs are double buffered.

[edit : "The data registers may only be

written to while the Blitter is idle." page 70/141 of the Jag_v8 documentation]

There is the double buffer (for some registers) for the jag2 blitter.

SebRmv · February 3, 2008

[edit : "The data registers may only be
written to while the Blitter is idle." page 70/141 of the Jag_v8 documentation]

Yes, I have seen that also in the doc.

So this could mean that some of the registers are double buffered

(in particular, almost all of the address registers)

Does anybody have tested this?

SebRmv · May 14, 2008

Assuming that resolution is DEPTH16

What about DRAM->DRAM in PIXEL mode

In particular, I wonder whether

DRAM->DRAM in PIXEL mode

is faster or slower than

DRAM->GPU (fast access) in PIXEL mode

(assuming that it is correcly aligned to work: edit: by the way, does fast access work in pixel mode with DEPTH16)

GPU (fast access) -> DRAM in PHRASE mode

ie we blit in two times, using the GPU RAM as intermediate buffer

Thanks

edit:

and thus, what about normal (slow) access

DRAM -> GPU in PIXEL mode : 7 * 4 = 28 cycles/phrase if I have understood correctly what precedes

GPU -> DRAM in PHRASE mode : 11 cycles/phrase according to the benchmark above

so, that would be 39 cycles/phrase

and, depending on the answer to the question,

DRAM -> GPU in PIXEL mode : 7 * 4 = 28 cycles/phrase if I have understood correctly what precedes

GPU (fast access) -> DRAM in PHRASE mode

SCPCD · May 14, 2008

I will make bench this night.

But, don't forget that you can not read at GPU_RAM+$8000, it's a write only access. (To allow faster transfers into the GPU space, all the registers are also available as thirty-two bit memory, at an offset of 8000 hex from their normal addresses. At this address, the internal memory is write only. p43/141)

SebRmv · May 14, 2008

I will make bench this night.

But, don't forget that you can not read at GPU_RAM+$8000, it's a write only access. (To allow faster transfers into the GPU space, all the registers are also available as thirty-two bit memory, at an offset of 8000 hex from their normal addresses. At this address, the internal memory is write only. p43/141)

Thanks, I forgot this point.

So basically, I would like to compare

DRAM->DRAM in PIXEL mode (DEPTH16)

to

DRAM -> GPU in PIXEL mode (still DEPTH16)

GPU -> DRAM in PHRASE mode

(which is 39 cycles/phrase if I have understood correcly)

SCPCD · May 14, 2008

move.l		#PITCH1|PIXEL16|WID128|XADDPIX,d0
moveq		#0,d1
loooooop:
move.l		d0,A2_FLAGS
move.l		#$100000,A2_BASE
move.l		d1,A2_PIXEL
move.l		d1,A2_STEP

move.l		d0,A1_FLAGS
move.l		#$80000,A1_BASE
move.l		d1,A1_PIXEL
move.l		d1,A1_FPIXEL
move.l		d1,A1_STEP
move.l		d1,A1_FSTEP
move.l		d1,A1_CLIP
move.l		d1,A1_INC
move.l		d1,A1_FINC

move.l		#$00010400,B_COUNT
move.l		#SRCEN|LFU_REPLACE,B_CMD

bra.s		loooooop

We have : 11 cycles to copy 2 bytes, about 44cycles/phrases

--------------------------------------------------------------------------------------------

move.l		#PITCH1|PIXEL16|WID128|XADDPIX,d0
moveq		#0,d1
loooooop:
move.l		d0,A2_FLAGS
move.l		#$100000,A2_BASE
move.l		d1,A2_PIXEL
move.l		d1,A2_STEP

move.l		d0,A1_FLAGS
move.l		#G_RAM+$8000,A1_BASE
move.l		d1,A1_PIXEL
move.l		d1,A1_FPIXEL
move.l		d1,A1_STEP
move.l		d1,A1_FSTEP
move.l		d1,A1_CLIP
move.l		d1,A1_INC
move.l		d1,A1_FINC

move.l		#$00010400,B_COUNT
move.l		#SRCEN|LFU_REPLACE,B_CMD

bra.s		loooooop

We have 5 cycles for 2 bytes, about 20cycles/phrases

move.l		#PITCH1|PIXEL16|WID128|XADDPHR,d0
moveq		#0,d1
loooooop:
move.l		d0,A2_FLAGS
move.l		#G_RAM,A2_BASE
move.l		d1,A2_PIXEL
move.l		d1,A2_STEP

move.l		d0,A1_FLAGS
move.l		#$100000,A1_BASE
move.l		d1,A1_PIXEL
move.l		d1,A1_FPIXEL
move.l		d1,A1_STEP
move.l		d1,A1_FSTEP
move.l		d1,A1_CLIP
move.l		d1,A1_INC
move.l		d1,A1_FINC

move.l		#$00010400,B_COUNT
move.l		#SRCEN|LFU_REPLACE,B_CMD

bra.s		loooooop

We have 11 cycles/phrases

for DRAM->GPU, then GPU->DRAM we have about 20+11 = 31cycles/phrases which is < 44 cycles/phrases for the DRAM->DRAM version.

Blitter Timing

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in