Jump to content

Blitter Timing


SCPCD

Recommended Posts

This topic goal is to give informations about different timing acces for the blitter and to know more about blitter operating.

 

First of all, the exemple code :

    move.l        #PITCH1|PIXEL32|WID128|XADDPHR,d0
    moveq        #0,d1

    move.l        d0,A2_FLAGS
    move.l        #source,A2_BASE
    move.l        d1,A2_PIXEL
    move.l        d1,A2_STEP
    
    move.l        d0,A1_FLAGS
    move.l        #destination,A1_BASE
    move.l        d1,A1_PIXEL
    move.l        d1,A1_FPIXEL
    move.l        d1,A1_STEP
    move.l        d1,A1_FSTEP
    move.l        d1,A1_CLIP
    move.l        d1,A1_INC
    move.l        d1,A1_FINC
    
    move.l        #$00010400,B_COUNT
    move.l        #SRCEN|LFU_REPLACE,B_CMD

configure the blitter in 32-bit by pixel and transfert in Phrase mode.

Source & destination will be different for each case that are describe bellow.

 

We blitt : $0001 * $0400 * 4 (because 32-bit pixel selected) = 4096bytes.

SRCEN : activation of a source, and LFU_REPLACE for a simple data copy.

all other blitter register are not used and initialised here to zero.

 

The GPU->DRAM transfert in phrase mode :

source : $F03000 (G_RAM)

destination : somewhere into the DRAM (phrase aligned)

 

post-5-1186932355_thumb.jpg

Result : 11 cycles per phrase -> 4096*11/8 = 5632 cycles for 4K

 

The DRAM->GPU transfert in phrase mode :

source : somewhere into the DRAM (phrase aligned)

destination : $F03000 (G_RAM)

 

post-5-1186932269_thumb.jpg

Result : 7 cycles per phrase -> 4096*7/8 = 3584 cycles for 4K

 

The DRAM->GPU speed transfert in phrase mode :

source : somewhere into the DRAM (phrase aligned)

destination : $F03000+$8000 (G_RAM+$8000)

 

post-5-1186932120_thumb.jpg

Result : 5 cycles per phrase -> 4096*5/8 = 2560 cycles for 4K

 

 

-----------------------------------------------

other information in the futur :)

If you have a special timing to mesure, I can help ;)

Link to comment
Share on other sites

interesting! and what about the pixel mode?

in phrase mode, is it the same result if using 8 bpp, 16 bpp or 32 bpp?

Link to comment
Share on other sites

interesting! and what about the pixel mode?

in pixel mode we have same cycle time but insteed of a phrase it's a pixel size that is transfered :

- in PIXEL1 there is 64 DRAM read

- in PIXEL2 there is 32 DRAM read

- in PIXEL4 there is 16 DRAM read

- in PIXEL8 there is 8 DRAM read

- in PIXEL16 there is 4 DRAM read

- in PIXEL32 there is 2 DRAM read

 

the number of cycle for each read/write is exactly the same as before depending of the source & destination.

 

So for PIXEL16 in pixel mode, this is 4 times slower than in phrase mode.

 

in phrase mode, is it the same result if using 8 bpp, 16 bpp or 32 bpp?

yes it's exactly the same result.

 

(I haven't verified that datas are correct for less than 32bpp mode from & to the GPU)

Link to comment
Share on other sites

Nice!... =)

 

So.. you are saying that its slower to copy data FROM gpu mem 2 dram (5632) than it is to copy TO gou mem(3584)?

Any ideas why? I dont realy understand that from a logical point of view.. ? ..does the DRAM timings differ in read vs write? ...sram should have same read & write time right? ..how about dram?

or is it an OP stall, page miss on writes to dram or what? why is it slower to copy TO dram than it is to copy from it?

 

What are the GPU dooing in your example code? ...Im just curious because I have always wondered what happens when you start a GPU->dram blitt and continue running GPU code afterwards... I mean logically someone would have to halt to give the blitter acceess to gpumem right?... ie a gpu2-dram blitt would halt gpu exec and parallelism would be lost, or?

Can this be verified with your LA? ....ie faster blitts from gpumem if gpu is turned off. (or perhapps its just a stall issue, once the blitt start it finishes in same time?!)

 

 

Its nice to see that the +$8000 is realy faster, because in my experiments (iirc the tunnel code) It had no effect at all.. not something that could be seen in fps count anyway.. (or "raster bar timings", they were same size).

 

 

Last question:

When i did my own timing test of the memory (simple color "rasters" on screen) the DRAM->DRAM blitt code was As Fast as anything that had to do with the GPU memory!... I assume src&desr was within the same Page in dram hence the speed..

But it would be nice to se such a timing... ie blitter phrase blitt of dram->dram, preferabely within the same page. ...well offcourse it would be nice to see what happens if blitt crosses page boundary so there will be a page miss on each phrase.. (then perhapps we will se my point on why cleverly written GPU sdram code would be faster than the "running from main ram" thingy..

Perhapps this is because dram->dram will be true 64bits, while dram->gpumem would have to be 2*32bits..

Ie keeping the page boundaries in mind when you design your code will give you the ultimate power of the jaguar!.

 

sorry for all question, but timing issues are interesting from an optimisation point of view :P

 

Nice work!

cheers

/Sym

Link to comment
Share on other sites

So.. you are saying that its slower to copy data FROM gpu mem 2 dram (5632) than it is to copy TO gou mem(3584)?

Any ideas why? I dont realy understand that from a logical point of view.. ? ..does the DRAM timings differ in read vs write? ...sram should have same read & write time right? ..how about dram?

or is it an OP stall, page miss on writes to dram or what? why is it slower to copy TO dram than it is to copy from it?

I think that it's a pipelining effect that takes different time for a read or a write :

a read from a slow memory to a fast memory is easily pipelined. The opposit is not so easy to avoid lose of cycles.

What are the GPU dooing in your example code?

GPU don't work in my example :)

I'll try soon whith a code.

Im just curious because I have always wondered what happens when you start a GPU->dram blitt and continue running GPU code afterwards... I mean logically someone would have to halt to give the blitter acceess to gpumem right?... ie a gpu2-dram blitt would halt gpu exec and parallelism would be lost, or?

Can this be verified with your LA? ....ie faster blitts from gpumem if gpu is turned off. (or perhapps its just a stall issue, once the blitt start it finishes in same time?!)

When the GPU execute code from internal memory, it uses 50% MAX of the internal memory bandwidth. (1 instruction per cycle, and prefetch of 2 instructions, -> 1 memory access each 2 cycles)

So blitt when gpu run at is higher speed reduce the blitt speed.

 

I can verified it with the LA ;)

Last question:

When i did my own timing test of the memory (simple color "rasters" on screen) the DRAM->DRAM blitt code was As Fast as anything that had to do with the GPU memory!... I assume src&desr was within the same Page in dram hence the speed..

But it would be nice to se such a timing...

I will do it someday ;)

 

sorry for all question, but timing issues are interesting from an optimisation point of view :P

Exactly !

 

Nice work!

Thanks ;)

Link to comment
Share on other sites

  • 5 months later...

Ok, not really related but I have a question about the blitter:

 

do the registers are double buffered?

(this would mean that we can prepare the next blit

while the current one is being completed)

if yes, which ones exactly?

 

 

Link to comment
Share on other sites

I'm not sure, but I think that neither of blitter's regs are double buffered.

 

[edit : "The data registers may only be

written to while the Blitter is idle." page 70/141 of the Jag_v8 documentation]

 

 

There is the double buffer (for some registers) for the jag2 blitter.

Link to comment
Share on other sites

[edit : "The data registers may only be

written to while the Blitter is idle." page 70/141 of the Jag_v8 documentation]

 

Yes, I have seen that also in the doc.

So this could mean that some of the registers are double buffered

(in particular, almost all of the address registers)

 

Does anybody have tested this?

 

 

Link to comment
Share on other sites

  • 3 months later...

Assuming that resolution is DEPTH16

 

What about DRAM->DRAM in PIXEL mode

 

In particular, I wonder whether

 

DRAM->DRAM in PIXEL mode

 

is faster or slower than

 

DRAM->GPU (fast access) in PIXEL mode

(assuming that it is correcly aligned to work: edit: by the way, does fast access work in pixel mode with DEPTH16)

GPU (fast access) -> DRAM in PHRASE mode

 

ie we blit in two times, using the GPU RAM as intermediate buffer

 

Thanks

 

edit:

 

and thus, what about normal (slow) access

 

DRAM -> GPU in PIXEL mode : 7 * 4 = 28 cycles/phrase if I have understood correctly what precedes

GPU -> DRAM in PHRASE mode : 11 cycles/phrase according to the benchmark above

so, that would be 39 cycles/phrase

 

and, depending on the answer to the question,

 

DRAM -> GPU in PIXEL mode : 7 * 4 = 28 cycles/phrase if I have understood correctly what precedes

GPU (fast access) -> DRAM in PHRASE mode

Link to comment
Share on other sites

I will make bench this night.

 

But, don't forget that you can not read at GPU_RAM+$8000, it's a write only access. ;) (To allow faster transfers into the GPU space, all the registers are also available as thirty-two bit memory, at an offset of 8000 hex from their normal addresses. At this address, the internal memory is write only. p43/141)

 

Link to comment
Share on other sites

I will make bench this night.

 

But, don't forget that you can not read at GPU_RAM+$8000, it's a write only access. ;) (To allow faster transfers into the GPU space, all the registers are also available as thirty-two bit memory, at an offset of 8000 hex from their normal addresses. At this address, the internal memory is write only. p43/141)

 

Thanks, I forgot this point.

 

So basically, I would like to compare

 

DRAM->DRAM in PIXEL mode (DEPTH16)

 

to

 

DRAM -> GPU in PIXEL mode (still DEPTH16)

GPU -> DRAM in PHRASE mode

(which is 39 cycles/phrase if I have understood correcly)

Link to comment
Share on other sites

move.l		#PITCH1|PIXEL16|WID128|XADDPIX,d0
moveq		#0,d1
loooooop:
move.l		d0,A2_FLAGS
move.l		#$100000,A2_BASE
move.l		d1,A2_PIXEL
move.l		d1,A2_STEP

move.l		d0,A1_FLAGS
move.l		#$80000,A1_BASE
move.l		d1,A1_PIXEL
move.l		d1,A1_FPIXEL
move.l		d1,A1_STEP
move.l		d1,A1_FSTEP
move.l		d1,A1_CLIP
move.l		d1,A1_INC
move.l		d1,A1_FINC

move.l		#$00010400,B_COUNT
move.l		#SRCEN|LFU_REPLACE,B_CMD

bra.s		loooooop

post-5-1210789893_thumb.jpg

 

We have : 11 cycles to copy 2 bytes, about 44cycles/phrases

 

--------------------------------------------------------------------------------------------

move.l		#PITCH1|PIXEL16|WID128|XADDPIX,d0
moveq		#0,d1
loooooop:
move.l		d0,A2_FLAGS
move.l		#$100000,A2_BASE
move.l		d1,A2_PIXEL
move.l		d1,A2_STEP

move.l		d0,A1_FLAGS
move.l		#G_RAM+$8000,A1_BASE
move.l		d1,A1_PIXEL
move.l		d1,A1_FPIXEL
move.l		d1,A1_STEP
move.l		d1,A1_FSTEP
move.l		d1,A1_CLIP
move.l		d1,A1_INC
move.l		d1,A1_FINC

move.l		#$00010400,B_COUNT
move.l		#SRCEN|LFU_REPLACE,B_CMD

bra.s		loooooop

 

post-5-1210790400_thumb.jpg

 

We have 5 cycles for 2 bytes, about 20cycles/phrases

 

move.l		#PITCH1|PIXEL16|WID128|XADDPHR,d0
moveq		#0,d1
loooooop:
move.l		d0,A2_FLAGS
move.l		#G_RAM,A2_BASE
move.l		d1,A2_PIXEL
move.l		d1,A2_STEP

move.l		d0,A1_FLAGS
move.l		#$100000,A1_BASE
move.l		d1,A1_PIXEL
move.l		d1,A1_FPIXEL
move.l		d1,A1_STEP
move.l		d1,A1_FSTEP
move.l		d1,A1_CLIP
move.l		d1,A1_INC
move.l		d1,A1_FINC

move.l		#$00010400,B_COUNT
move.l		#SRCEN|LFU_REPLACE,B_CMD

bra.s		loooooop

 

post-5-1210790876_thumb.jpg

 

We have 11 cycles/phrases

 

for DRAM->GPU, then GPU->DRAM we have about 20+11 = 31cycles/phrases which is < 44 cycles/phrases for the DRAM->DRAM version.

 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...