SCPCD Posted August 12, 2007 Report Share Posted August 12, 2007 This topic goal is to give informations about different timing acces for the blitter and to know more about blitter operating. First of all, the exemple code : move.l #PITCH1|PIXEL32|WID128|XADDPHR,d0 moveq #0,d1 move.l d0,A2_FLAGS move.l #source,A2_BASE move.l d1,A2_PIXEL move.l d1,A2_STEP move.l d0,A1_FLAGS move.l #destination,A1_BASE move.l d1,A1_PIXEL move.l d1,A1_FPIXEL move.l d1,A1_STEP move.l d1,A1_FSTEP move.l d1,A1_CLIP move.l d1,A1_INC move.l d1,A1_FINC move.l #$00010400,B_COUNT move.l #SRCEN|LFU_REPLACE,B_CMD configure the blitter in 32-bit by pixel and transfert in Phrase mode. Source & destination will be different for each case that are describe bellow. We blitt : $0001 * $0400 * 4 (because 32-bit pixel selected) = 4096bytes. SRCEN : activation of a source, and LFU_REPLACE for a simple data copy. all other blitter register are not used and initialised here to zero. The GPU->DRAM transfert in phrase mode : source : $F03000 (G_RAM) destination : somewhere into the DRAM (phrase aligned) Result : 11 cycles per phrase -> 4096*11/8 = 5632 cycles for 4K The DRAM->GPU transfert in phrase mode : source : somewhere into the DRAM (phrase aligned) destination : $F03000 (G_RAM) Result : 7 cycles per phrase -> 4096*7/8 = 3584 cycles for 4K The DRAM->GPU speed transfert in phrase mode : source : somewhere into the DRAM (phrase aligned) destination : $F03000+$8000 (G_RAM+$8000) Result : 5 cycles per phrase -> 4096*5/8 = 2560 cycles for 4K ----------------------------------------------- other information in the futur If you have a special timing to mesure, I can help Link to comment Share on other sites More sharing options...
SebRmv Posted August 13, 2007 Report Share Posted August 13, 2007 interesting! and what about the pixel mode? in phrase mode, is it the same result if using 8 bpp, 16 bpp or 32 bpp? Link to comment Share on other sites More sharing options...
SCPCD Posted August 13, 2007 Author Report Share Posted August 13, 2007 I will try soon Link to comment Share on other sites More sharing options...
SCPCD Posted August 15, 2007 Author Report Share Posted August 15, 2007 interesting! and what about the pixel mode? in pixel mode we have same cycle time but insteed of a phrase it's a pixel size that is transfered : - in PIXEL1 there is 64 DRAM read - in PIXEL2 there is 32 DRAM read - in PIXEL4 there is 16 DRAM read - in PIXEL8 there is 8 DRAM read - in PIXEL16 there is 4 DRAM read - in PIXEL32 there is 2 DRAM read the number of cycle for each read/write is exactly the same as before depending of the source & destination. So for PIXEL16 in pixel mode, this is 4 times slower than in phrase mode. in phrase mode, is it the same result if using 8 bpp, 16 bpp or 32 bpp? yes it's exactly the same result. (I haven't verified that datas are correct for less than 32bpp mode from & to the GPU) Link to comment Share on other sites More sharing options...
Symmetry of TNG Posted August 18, 2007 Report Share Posted August 18, 2007 Nice!... =) So.. you are saying that its slower to copy data FROM gpu mem 2 dram (5632) than it is to copy TO gou mem(3584)? Any ideas why? I dont realy understand that from a logical point of view.. ? ..does the DRAM timings differ in read vs write? ...sram should have same read & write time right? ..how about dram? or is it an OP stall, page miss on writes to dram or what? why is it slower to copy TO dram than it is to copy from it? What are the GPU dooing in your example code? ...Im just curious because I have always wondered what happens when you start a GPU->dram blitt and continue running GPU code afterwards... I mean logically someone would have to halt to give the blitter acceess to gpumem right?... ie a gpu2-dram blitt would halt gpu exec and parallelism would be lost, or? Can this be verified with your LA? ....ie faster blitts from gpumem if gpu is turned off. (or perhapps its just a stall issue, once the blitt start it finishes in same time?!) Its nice to see that the +$8000 is realy faster, because in my experiments (iirc the tunnel code) It had no effect at all.. not something that could be seen in fps count anyway.. (or "raster bar timings", they were same size). Last question: When i did my own timing test of the memory (simple color "rasters" on screen) the DRAM->DRAM blitt code was As Fast as anything that had to do with the GPU memory!... I assume src&desr was within the same Page in dram hence the speed.. But it would be nice to se such a timing... ie blitter phrase blitt of dram->dram, preferabely within the same page. ...well offcourse it would be nice to see what happens if blitt crosses page boundary so there will be a page miss on each phrase.. (then perhapps we will se my point on why cleverly written GPU sdram code would be faster than the "running from main ram" thingy.. Perhapps this is because dram->dram will be true 64bits, while dram->gpumem would have to be 2*32bits.. Ie keeping the page boundaries in mind when you design your code will give you the ultimate power of the jaguar!. sorry for all question, but timing issues are interesting from an optimisation point of view Nice work! cheers /Sym Link to comment Share on other sites More sharing options...
SCPCD Posted August 20, 2007 Author Report Share Posted August 20, 2007 So.. you are saying that its slower to copy data FROM gpu mem 2 dram (5632) than it is to copy TO gou mem(3584)? Any ideas why? I dont realy understand that from a logical point of view.. ? ..does the DRAM timings differ in read vs write? ...sram should have same read & write time right? ..how about dram? or is it an OP stall, page miss on writes to dram or what? why is it slower to copy TO dram than it is to copy from it? I think that it's a pipelining effect that takes different time for a read or a write : a read from a slow memory to a fast memory is easily pipelined. The opposit is not so easy to avoid lose of cycles. What are the GPU dooing in your example code? GPU don't work in my example I'll try soon whith a code. Im just curious because I have always wondered what happens when you start a GPU->dram blitt and continue running GPU code afterwards... I mean logically someone would have to halt to give the blitter acceess to gpumem right?... ie a gpu2-dram blitt would halt gpu exec and parallelism would be lost, or? Can this be verified with your LA? ....ie faster blitts from gpumem if gpu is turned off. (or perhapps its just a stall issue, once the blitt start it finishes in same time?!) When the GPU execute code from internal memory, it uses 50% MAX of the internal memory bandwidth. (1 instruction per cycle, and prefetch of 2 instructions, -> 1 memory access each 2 cycles) So blitt when gpu run at is higher speed reduce the blitt speed. I can verified it with the LA Last question: When i did my own timing test of the memory (simple color "rasters" on screen) the DRAM->DRAM blitt code was As Fast as anything that had to do with the GPU memory!... I assume src&desr was within the same Page in dram hence the speed.. But it would be nice to se such a timing... I will do it someday sorry for all question, but timing issues are interesting from an optimisation point of view Exactly ! Nice work! Thanks Link to comment Share on other sites More sharing options...
SebRmv Posted February 2, 2008 Report Share Posted February 2, 2008 Ok, not really related but I have a question about the blitter: do the registers are double buffered? (this would mean that we can prepare the next blit while the current one is being completed) if yes, which ones exactly? Link to comment Share on other sites More sharing options...
SCPCD Posted February 2, 2008 Author Report Share Posted February 2, 2008 I'm not sure, but I think that neither of blitter's regs are double buffered. [edit : "The data registers may only be written to while the Blitter is idle." page 70/141 of the Jag_v8 documentation] There is the double buffer (for some registers) for the jag2 blitter. Link to comment Share on other sites More sharing options...
SebRmv Posted February 3, 2008 Report Share Posted February 3, 2008 [edit : "The data registers may only be written to while the Blitter is idle." page 70/141 of the Jag_v8 documentation] Yes, I have seen that also in the doc. So this could mean that some of the registers are double buffered (in particular, almost all of the address registers) Does anybody have tested this? Link to comment Share on other sites More sharing options...
SebRmv Posted May 14, 2008 Report Share Posted May 14, 2008 Assuming that resolution is DEPTH16 What about DRAM->DRAM in PIXEL mode In particular, I wonder whether DRAM->DRAM in PIXEL mode is faster or slower than DRAM->GPU (fast access) in PIXEL mode (assuming that it is correcly aligned to work: edit: by the way, does fast access work in pixel mode with DEPTH16) GPU (fast access) -> DRAM in PHRASE mode ie we blit in two times, using the GPU RAM as intermediate buffer Thanks edit: and thus, what about normal (slow) access DRAM -> GPU in PIXEL mode : 7 * 4 = 28 cycles/phrase if I have understood correctly what precedes GPU -> DRAM in PHRASE mode : 11 cycles/phrase according to the benchmark above so, that would be 39 cycles/phrase and, depending on the answer to the question, DRAM -> GPU in PIXEL mode : 7 * 4 = 28 cycles/phrase if I have understood correctly what precedes GPU (fast access) -> DRAM in PHRASE mode Link to comment Share on other sites More sharing options...
SCPCD Posted May 14, 2008 Author Report Share Posted May 14, 2008 I will make bench this night. But, don't forget that you can not read at GPU_RAM+$8000, it's a write only access. (To allow faster transfers into the GPU space, all the registers are also available as thirty-two bit memory, at an offset of 8000 hex from their normal addresses. At this address, the internal memory is write only. p43/141) Link to comment Share on other sites More sharing options...
SebRmv Posted May 14, 2008 Report Share Posted May 14, 2008 I will make bench this night. But, don't forget that you can not read at GPU_RAM+$8000, it's a write only access. (To allow faster transfers into the GPU space, all the registers are also available as thirty-two bit memory, at an offset of 8000 hex from their normal addresses. At this address, the internal memory is write only. p43/141) Thanks, I forgot this point. So basically, I would like to compare DRAM->DRAM in PIXEL mode (DEPTH16) to DRAM -> GPU in PIXEL mode (still DEPTH16) GPU -> DRAM in PHRASE mode (which is 39 cycles/phrase if I have understood correcly) Link to comment Share on other sites More sharing options...
SCPCD Posted May 14, 2008 Author Report Share Posted May 14, 2008 move.l #PITCH1|PIXEL16|WID128|XADDPIX,d0 moveq #0,d1 loooooop: move.l d0,A2_FLAGS move.l #$100000,A2_BASE move.l d1,A2_PIXEL move.l d1,A2_STEP move.l d0,A1_FLAGS move.l #$80000,A1_BASE move.l d1,A1_PIXEL move.l d1,A1_FPIXEL move.l d1,A1_STEP move.l d1,A1_FSTEP move.l d1,A1_CLIP move.l d1,A1_INC move.l d1,A1_FINC move.l #$00010400,B_COUNT move.l #SRCEN|LFU_REPLACE,B_CMD bra.s loooooop We have : 11 cycles to copy 2 bytes, about 44cycles/phrases -------------------------------------------------------------------------------------------- move.l #PITCH1|PIXEL16|WID128|XADDPIX,d0 moveq #0,d1 loooooop: move.l d0,A2_FLAGS move.l #$100000,A2_BASE move.l d1,A2_PIXEL move.l d1,A2_STEP move.l d0,A1_FLAGS move.l #G_RAM+$8000,A1_BASE move.l d1,A1_PIXEL move.l d1,A1_FPIXEL move.l d1,A1_STEP move.l d1,A1_FSTEP move.l d1,A1_CLIP move.l d1,A1_INC move.l d1,A1_FINC move.l #$00010400,B_COUNT move.l #SRCEN|LFU_REPLACE,B_CMD bra.s loooooop We have 5 cycles for 2 bytes, about 20cycles/phrases move.l #PITCH1|PIXEL16|WID128|XADDPHR,d0 moveq #0,d1 loooooop: move.l d0,A2_FLAGS move.l #G_RAM,A2_BASE move.l d1,A2_PIXEL move.l d1,A2_STEP move.l d0,A1_FLAGS move.l #$100000,A1_BASE move.l d1,A1_PIXEL move.l d1,A1_FPIXEL move.l d1,A1_STEP move.l d1,A1_FSTEP move.l d1,A1_CLIP move.l d1,A1_INC move.l d1,A1_FINC move.l #$00010400,B_COUNT move.l #SRCEN|LFU_REPLACE,B_CMD bra.s loooooop We have 11 cycles/phrases for DRAM->GPU, then GPU->DRAM we have about 20+11 = 31cycles/phrases which is < 44 cycles/phrases for the DRAM->DRAM version. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now