Jump to content
Jagware
Sign in to follow this  
SCPCD

Blitter Timing

Recommended Posts

SCPCD    0

This topic goal is to give informations about different timing acces for the blitter and to know more about blitter operating.

 

First of all, the exemple code :

    move.l        #PITCH1|PIXEL32|WID128|XADDPHR,d0
    moveq        #0,d1

    move.l        d0,A2_FLAGS
    move.l        #source,A2_BASE
    move.l        d1,A2_PIXEL
    move.l        d1,A2_STEP
    
    move.l        d0,A1_FLAGS
    move.l        #destination,A1_BASE
    move.l        d1,A1_PIXEL
    move.l        d1,A1_FPIXEL
    move.l        d1,A1_STEP
    move.l        d1,A1_FSTEP
    move.l        d1,A1_CLIP
    move.l        d1,A1_INC
    move.l        d1,A1_FINC
    
    move.l        #$00010400,B_COUNT
    move.l        #SRCEN|LFU_REPLACE,B_CMD

configure the blitter in 32-bit by pixel and transfert in Phrase mode.

Source & destination will be different for each case that are describe bellow.

 

We blitt : $0001 * $0400 * 4 (because 32-bit pixel selected) = 4096bytes.

SRCEN : activation of a source, and LFU_REPLACE for a simple data copy.

all other blitter register are not used and initialised here to zero.

 

The GPU->DRAM transfert in phrase mode :

source : $F03000 (G_RAM)

destination : somewhere into the DRAM (phrase aligned)

 

post-5-1186932355_thumb.jpg

Result : 11 cycles per phrase -> 4096*11/8 = 5632 cycles for 4K

 

The DRAM->GPU transfert in phrase mode :

source : somewhere into the DRAM (phrase aligned)

destination : $F03000 (G_RAM)

 

post-5-1186932269_thumb.jpg

Result : 7 cycles per phrase -> 4096*7/8 = 3584 cycles for 4K

 

The DRAM->GPU speed transfert in phrase mode :

source : somewhere into the DRAM (phrase aligned)

destination : $F03000+$8000 (G_RAM+$8000)

 

post-5-1186932120_thumb.jpg

Result : 5 cycles per phrase -> 4096*5/8 = 2560 cycles for 4K

 

 

-----------------------------------------------

other information in the futur :)

If you have a special timing to mesure, I can help ;)

Share this post


Link to post
Share on other sites
SebRmv    2

interesting! and what about the pixel mode?

in phrase mode, is it the same result if using 8 bpp, 16 bpp or 32 bpp?

Share this post


Link to post
Share on other sites
SCPCD    0
interesting! and what about the pixel mode?

in pixel mode we have same cycle time but insteed of a phrase it's a pixel size that is transfered :

- in PIXEL1 there is 64 DRAM read

- in PIXEL2 there is 32 DRAM read

- in PIXEL4 there is 16 DRAM read

- in PIXEL8 there is 8 DRAM read

- in PIXEL16 there is 4 DRAM read

- in PIXEL32 there is 2 DRAM read

 

the number of cycle for each read/write is exactly the same as before depending of the source & destination.

 

So for PIXEL16 in pixel mode, this is 4 times slower than in phrase mode.

 

in phrase mode, is it the same result if using 8 bpp, 16 bpp or 32 bpp?

yes it's exactly the same result.

 

(I haven't verified that datas are correct for less than 32bpp mode from & to the GPU)

Share this post


Link to post
Share on other sites

Nice!... =)

 

So.. you are saying that its slower to copy data FROM gpu mem 2 dram (5632) than it is to copy TO gou mem(3584)?

Any ideas why? I dont realy understand that from a logical point of view.. ? ..does the DRAM timings differ in read vs write? ...sram should have same read & write time right? ..how about dram?

or is it an OP stall, page miss on writes to dram or what? why is it slower to copy TO dram than it is to copy from it?

 

What are the GPU dooing in your example code? ...Im just curious because I have always wondered what happens when you start a GPU->dram blitt and continue running GPU code afterwards... I mean logically someone would have to halt to give the blitter acceess to gpumem right?... ie a gpu2-dram blitt would halt gpu exec and parallelism would be lost, or?

Can this be verified with your LA? ....ie faster blitts from gpumem if gpu is turned off. (or perhapps its just a stall issue, once the blitt start it finishes in same time?!)

 

 

Its nice to see that the +$8000 is realy faster, because in my experiments (iirc the tunnel code) It had no effect at all.. not something that could be seen in fps count anyway.. (or "raster bar timings", they were same size).

 

 

Last question:

When i did my own timing test of the memory (simple color "rasters" on screen) the DRAM->DRAM blitt code was As Fast as anything that had to do with the GPU memory!... I assume src&desr was within the same Page in dram hence the speed..

But it would be nice to se such a timing... ie blitter phrase blitt of dram->dram, preferabely within the same page. ...well offcourse it would be nice to see what happens if blitt crosses page boundary so there will be a page miss on each phrase.. (then perhapps we will se my point on why cleverly written GPU sdram code would be faster than the "running from main ram" thingy..

Perhapps this is because dram->dram will be true 64bits, while dram->gpumem would have to be 2*32bits..

Ie keeping the page boundaries in mind when you design your code will give you the ultimate power of the jaguar!.

 

sorry for all question, but timing issues are interesting from an optimisation point of view :P

 

Nice work!

cheers

/Sym

Share this post


Link to post
Share on other sites
SCPCD    0
So.. you are saying that its slower to copy data FROM gpu mem 2 dram (5632) than it is to copy TO gou mem(3584)?

Any ideas why? I dont realy understand that from a logical point of view.. ? ..does the DRAM timings differ in read vs write? ...sram should have same read & write time right? ..how about dram?

or is it an OP stall, page miss on writes to dram or what? why is it slower to copy TO dram than it is to copy from it?

I think that it's a pipelining effect that takes different time for a read or a write :

a read from a slow memory to a fast memory is easily pipelined. The opposit is not so easy to avoid lose of cycles.

What are the GPU dooing in your example code?

GPU don't work in my example :)

I'll try soon whith a code.

Im just curious because I have always wondered what happens when you start a GPU->dram blitt and continue running GPU code afterwards... I mean logically someone would have to halt to give the blitter acceess to gpumem right?... ie a gpu2-dram blitt would halt gpu exec and parallelism would be lost, or?

Can this be verified with your LA? ....ie faster blitts from gpumem if gpu is turned off. (or perhapps its just a stall issue, once the blitt start it finishes in same time?!)

When the GPU execute code from internal memory, it uses 50% MAX of the internal memory bandwidth. (1 instruction per cycle, and prefetch of 2 instructions, -> 1 memory access each 2 cycles)

So blitt when gpu run at is higher speed reduce the blitt speed.

 

I can verified it with the LA ;)

Last question:

When i did my own timing test of the memory (simple color "rasters" on screen) the DRAM->DRAM blitt code was As Fast as anything that had to do with the GPU memory!... I assume src&desr was within the same Page in dram hence the speed..

But it would be nice to se such a timing...

I will do it someday ;)

 

sorry for all question, but timing issues are interesting from an optimisation point of view :P

Exactly !

 

Nice work!

Thanks ;)

Share this post


Link to post
Share on other sites
SebRmv    2

Ok, not really related but I have a question about the blitter:

 

do the registers are double buffered?

(this would mean that we can prepare the next blit

while the current one is being completed)

if yes, which ones exactly?

 

 

Share this post


Link to post
Share on other sites
SCPCD    0

I'm not sure, but I think that neither of blitter's regs are double buffered.

 

[edit : "The data registers may only be

written to while the Blitter is idle." page 70/141 of the Jag_v8 documentation]

 

 

There is the double buffer (for some registers) for the jag2 blitter.

Share this post


Link to post
Share on other sites
SebRmv    2
[edit : "The data registers may only be

written to while the Blitter is idle." page 70/141 of the Jag_v8 documentation]

 

Yes, I have seen that also in the doc.

So this could mean that some of the registers are double buffered

(in particular, almost all of the address registers)

 

Does anybody have tested this?

 

 

Share this post


Link to post
Share on other sites
SebRmv    2

Assuming that resolution is DEPTH16

 

What about DRAM->DRAM in PIXEL mode

 

In particular, I wonder whether

 

DRAM->DRAM in PIXEL mode

 

is faster or slower than

 

DRAM->GPU (fast access) in PIXEL mode

(assuming that it is correcly aligned to work: edit: by the way, does fast access work in pixel mode with DEPTH16)

GPU (fast access) -> DRAM in PHRASE mode

 

ie we blit in two times, using the GPU RAM as intermediate buffer

 

Thanks

 

edit:

 

and thus, what about normal (slow) access

 

DRAM -> GPU in PIXEL mode : 7 * 4 = 28 cycles/phrase if I have understood correctly what precedes

GPU -> DRAM in PHRASE mode : 11 cycles/phrase according to the benchmark above

so, that would be 39 cycles/phrase

 

and, depending on the answer to the question,

 

DRAM -> GPU in PIXEL mode : 7 * 4 = 28 cycles/phrase if I have understood correctly what precedes

GPU (fast access) -> DRAM in PHRASE mode

Share this post


Link to post
Share on other sites
SCPCD    0

I will make bench this night.

 

But, don't forget that you can not read at GPU_RAM+$8000, it's a write only access. ;) (To allow faster transfers into the GPU space, all the registers are also available as thirty-two bit memory, at an offset of 8000 hex from their normal addresses. At this address, the internal memory is write only. p43/141)

 

Share this post


Link to post
Share on other sites
SebRmv    2
I will make bench this night.

 

But, don't forget that you can not read at GPU_RAM+$8000, it's a write only access. ;) (To allow faster transfers into the GPU space, all the registers are also available as thirty-two bit memory, at an offset of 8000 hex from their normal addresses. At this address, the internal memory is write only. p43/141)

 

Thanks, I forgot this point.

 

So basically, I would like to compare

 

DRAM->DRAM in PIXEL mode (DEPTH16)

 

to

 

DRAM -> GPU in PIXEL mode (still DEPTH16)

GPU -> DRAM in PHRASE mode

(which is 39 cycles/phrase if I have understood correcly)

Share this post


Link to post
Share on other sites
SCPCD    0

move.l		#PITCH1|PIXEL16|WID128|XADDPIX,d0
moveq		#0,d1
loooooop:
move.l		d0,A2_FLAGS
move.l		#$100000,A2_BASE
move.l		d1,A2_PIXEL
move.l		d1,A2_STEP

move.l		d0,A1_FLAGS
move.l		#$80000,A1_BASE
move.l		d1,A1_PIXEL
move.l		d1,A1_FPIXEL
move.l		d1,A1_STEP
move.l		d1,A1_FSTEP
move.l		d1,A1_CLIP
move.l		d1,A1_INC
move.l		d1,A1_FINC

move.l		#$00010400,B_COUNT
move.l		#SRCEN|LFU_REPLACE,B_CMD

bra.s		loooooop

post-5-1210789893_thumb.jpg

 

We have : 11 cycles to copy 2 bytes, about 44cycles/phrases

 

--------------------------------------------------------------------------------------------

move.l		#PITCH1|PIXEL16|WID128|XADDPIX,d0
moveq		#0,d1
loooooop:
move.l		d0,A2_FLAGS
move.l		#$100000,A2_BASE
move.l		d1,A2_PIXEL
move.l		d1,A2_STEP

move.l		d0,A1_FLAGS
move.l		#G_RAM+$8000,A1_BASE
move.l		d1,A1_PIXEL
move.l		d1,A1_FPIXEL
move.l		d1,A1_STEP
move.l		d1,A1_FSTEP
move.l		d1,A1_CLIP
move.l		d1,A1_INC
move.l		d1,A1_FINC

move.l		#$00010400,B_COUNT
move.l		#SRCEN|LFU_REPLACE,B_CMD

bra.s		loooooop

 

post-5-1210790400_thumb.jpg

 

We have 5 cycles for 2 bytes, about 20cycles/phrases

 

move.l		#PITCH1|PIXEL16|WID128|XADDPHR,d0
moveq		#0,d1
loooooop:
move.l		d0,A2_FLAGS
move.l		#G_RAM,A2_BASE
move.l		d1,A2_PIXEL
move.l		d1,A2_STEP

move.l		d0,A1_FLAGS
move.l		#$100000,A1_BASE
move.l		d1,A1_PIXEL
move.l		d1,A1_FPIXEL
move.l		d1,A1_STEP
move.l		d1,A1_FSTEP
move.l		d1,A1_CLIP
move.l		d1,A1_INC
move.l		d1,A1_FINC

move.l		#$00010400,B_COUNT
move.l		#SRCEN|LFU_REPLACE,B_CMD

bra.s		loooooop

 

post-5-1210790876_thumb.jpg

 

We have 11 cycles/phrases

 

for DRAM->GPU, then GPU->DRAM we have about 20+11 = 31cycles/phrases which is < 44 cycles/phrases for the DRAM->DRAM version.

 

Share this post


Link to post
Share on other sites
Guest
You are commenting as a guest. If you have an account, please sign in.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoticons maximum are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  

×