Author Topic: Data Bandwidth Issues  (Read 1936 times)

0 Members and 1 Guest are viewing this topic.

Offline Hydrophilic

  • 128D user
  • *******
  • Posts: 1214
  • Age: 41
  • Location: Earth... still!
  • Activity:
    0%
  • Reputation: 232
  • Gender: Male
  • With us since: 25/01/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • H2Obsesson
Data Bandwidth Issues
« on: May 08, 2011, 05:35 AM »
Warning: long technical post (but easier than QCD)
 
This happens to relate to C128 software, audio, video, and the fast serial bus; but hopefully we can discuss ideas about bandwidth and data throughput in general (as applicable to C128).  So it doesn't have to deal with the serial bus or audio or video.  Just the idea of effeciently and reliably dealing with a continous stream of data.
 
I've been testing Media Player 128 with the fast serial protocol, and I've got the IRQ routines working well.  The frame rate is really nice compared to the JiffyDOS version, but I don't think I'm making full use of the potential bandwidth capable with fast serial.
 
So let me describe what my software is doing and what I think the issues are.  Please post if you notice something I'm missing or have any ideas or links that may might help.  Thanks!
 
There is the main IRQ running at approximately 7.8kHz.  Each IRQ will play a bit of audio and usually read a byte from the serial bus.  So to a first approximation, the bandwidth is 7.8k byte/s.  Here 'k' means 1000 and not 1024.
 
However, in the visible portion of the VIC screen, the VIC DMA (so called bad lines) means there is not always enough time to read a byte.  So 1 in 4 IRQs will not get data while VIC is in the visible portion.  Also the transition from 1MHz to 2MHz and the other transition 2MHz back to 1MHz do not read a byte from the serial bus.
 
So to a second apporximation (taking the above factors into account) the data bandwidth is 6.66k byte/s.  But when the video is encoded at that bandwidth, playback tends to stall every 20 seconds or so because data is not actually getting delivered at that speed.
 
Two reaons for this descrepancy seem obvious to me.  First there is delay by the drive between sectors.  On C1581 (emulated in VICE), there is about 256 microsecond delay between sectors on the same track, so there is a loss of about 2 byte transfers for each 254 bytes (on the same track).  Also there is 1 status byte transferred for each sector.  So that basically means 3 bytes 'lost'... or approximately 1% loss.
 
I haven't measured the delay for sectors not on the same track.  This is the main delay from the side of drive.  I don't think VICE emulates this delay accurately.  Even if it did, there would be variance with other devices like CMD-HD or uIEC.
 
A second source for reduced bandwidth is from the player software.  It has to temporarily suspend loading data at the end of each packet / start of next packet to calculate pointers.  This usually takes the time normally needed to load 1 to 3 bytes (depending on packet type and current CPU speed).  The packet sizes are usually a bit smaller than a sector... about 200 bytes on average.  So I guess overall this causes another 3% loss to bandwidth.
 
I tried scaling back the bandwidth on the video encoder by 10%... now down to 6.00k byte/s.  But it doesn't solve the problem... video playback will still studder.  It happens less frequently, but it still happens.  And it is always annoying when it happens.  When I say the video stutters, this is actually a symptom of the audio buffer being drained.
 
The stutter is rather predictable when playing back on VICE with C1581 emulation.  But on real NTSC C128 using uIEC, it *almost* never happens.  Some videos play fine, but most will experience one or two delays.  I imagine a real C1581 would stutter worse than the VICE emulation.  And I imagine a CMD-HD would be somewhere in the middle; that is to say more than SD2IEC device but less than C1581.
 
So one obvious solution is to reduce the encoded bandwidth even more, down to say 5.75k byte/s.  Another one would be to increase the size of the audio buffer.  Right now the audio buffer is 4 kiByte = 2.1 seconds of audio (using 2-bit audio).  I have another 4 kiByte that I was reserving for a second audio channel (stereo SID), so I could easily do this... but it seems to me 2.1s should be plenty to account for any delay by the disk drive...
 
It sounds like it might be a bug in the encoder, but I have several checks in there and haven't been able to track down a problem... I guess I could try changing the priority of the audio buffer versus the video frame... As I recall, it tries to keep the audio buffer at least 40% full which would be about 1s of audio.   I think 1 second should be plenty of time for a C1581 to move its head and load a track (assuming no read errors).
 
However, it seems like I'm just not using the potential of the fast serial bus correctly.  After all, you can load data at sustained speeds over 6k byte/s with most any fast serial device (the C1571 might be an exception).  Of course that is just sitting in a polling loop (as opposed to loading during IRQ like I'm doing...)
 
Some other less promising ideas involve changing the IRQ routines.  Such changes could make the IRQ routine a bit faster, and since they are called very often, this would actually add a few thousand cycles to the main routines.  However, such changes would be a hassle and more importantly, I don't think it would make a huge difference in data throughput.  So like, it would mean the packet demuxer could run a bit faster, but it is only taking 2% to begin with... improving down to 1.9% doesn't seem like it would be worth the effort.
 
Just in case somebody thinks changing the IRQ routines might help, let me describe how it works...
 
The CIA#1 timer A is used to generate the IRQs.  The timer IRQ is cleared by simply reading status register at $dc0d.  This also contains the status bit of fast-serial byte received.  Because the software is not always ready to receive a byte (because of VIC DMA for example), the status bit must be saved during every IRQ.  So the IRQ routine does this...
 
LDA $DC0D
ORA old_stat
AND #fast_serial
BNE read_byte
STA old_stat
 
'old_stat' is in zero page, so the ORA / STA instructions take 6 cycles total.  By the way, if the software isn't ready to read data, the BNE opcode gets changed to a dummy command, like CMP #nn.  Thus the fast-serial flag would be saved until the software can read the data.
 
One idea would be to have CIA#2 generate the timer interrupts.  Then we wouldn't have to save the fast-serial bit because we wouldn't even look at $dc0d when not ready to get data from the serial bus.  The main problem with the idea is that CIA#2 generates NMI and not IRQs.  During the visible portion of the screen, interrupts can occassionaly 'stack' on top of each other.
 
This stacking doesn't happen often, only during VIC DMA and the audio buffer has a page wrap.  The important thing is it does happen.  It is not a problem with normal IRQs because they will wait, but I imagine it would be trouble with NMIs which never wait...
 
Anyway, when there is data available and we decide to (un)load it from the serial register, there are two DEC commands performed...
 
DEC byte_n_sector
BEQ read_status
DEC byte_n_packet
BEQ end_packet
 
That's a bit simplified because 'byte_n_packet' is actually a 16-bit number.  The 'byte_n_sector' tells us when the drive will send a status byte so we know not to put it in the audio or video buffer as normal data.
 
It should be noted this is wasting some time by doing 2 DEC for each byte loaded.  Remember, this should ideally happen every IRQ... thus by elimating 7 cycles per IRQ, we're really saving about 54000 cycles per second...
 
As I said before I don't think such a change would improve data thoughput by a significant amount.  The change would require some slightly messy code.  Here's I think it would go...
 
You'd have to compare 'byte_n_packet' and 'byte_n_sector' before starting to load a packet.  Put the smaller number in a variable 'byte_until_something' then decrement that 1 variable (thus only 1 DEC per IRQ).  When it hits zero,  check another variable that tells us if 'something' is 'end of sector' or 'end of packet'.  If not 'end of packet' then read status byte and re-calculate 'byte_until_something'.  Yeah, that's what I call messy... and I probably left something important out...
 
I played around with some code to load data in a polling loop, but I didn't like how it was going... it was real messy with IRQs coming from CIA#1 (thus not very CPU effecient).  Might be better with CIA#2 assuming the NMI issue mentioned above doesn't crash the system... but either way, the software would have to toggle from normal video decoding into byte-polling mode and back...
 
This mode switching isn't completly bad idea; once the main thread has decompressed a video frame, it doesn't have anything else to do except check the STOP key.  So switching to a polling routine seems natural at that point.  Maybe I should have another look at the idea...
 
Another idea would be to increase the average packet size.  Thus given the same amount of raw data, there would be less overhead.  Therfore more effecient use of bandwidth, but I don't think it would improve the bandwidth... which is what I would like to do if possible.
 
Also, the average packet size is just under 256 bytes because video packets get loaded into a page of memory for decompression.  This makes decompression fast by using absolute indexed addressing.  For example LDA $F000,X takes 4 or 5 cycles; as opposed to indirect indexed addressing like LDA ($FE),Y which takes 5 or 6 cycles.
 
That's not much of a difference for a single byte, but because a bitmap is about 2 kiByte compressed that amounts to about 2000 cycles per frame... Wait, the frame rate is only 2~5 fps anyway.... so maybe larger packets would be a good idea...
 
Well, that's about it for my ideas.  Please let me know of any ideas or experiences that could be helpful for effecient data throughput / maximizing data bandwidth.  Thanks again!
 
I'm kupo for kupo nuts!

Offline airship

  • 128D user
  • *******
  • Posts: 1605
  • Age: 61
  • Location: Iowa, USA
  • Activity:
    0%
  • Country: us
  • Reputation: 113
  • Gender: Male
  • Former Editor, INFO Magazine
  • With us since: 28/07/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • Atomic Airship
Re: Data Bandwidth Issues
« Reply #1 on: May 08, 2011, 07:05 AM »
Great info, as always. Wish I knew enough to help!

Here are some uninformed (i.e. stupid) thoughts:

For 1581 access, have you tried setting sector interleave to 1?
Are you using fast serial or burst-mode commands?
« Last Edit: May 08, 2011, 07:26 AM by airship »
Serving up content-free posts on the Interwebs since 1983.
History of INFO Magazine

Offline Hydrophilic

  • 128D user
  • *******
  • Posts: 1214
  • Age: 41
  • Location: Earth... still!
  • Activity:
    0%
  • Reputation: 232
  • Gender: Male
  • With us since: 25/01/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • H2Obsesson
Re: Data Bandwidth Issues
« Reply #2 on: May 08, 2011, 01:27 PM »
This particular program is using burst-load command to get data over the fast-serial bus. 
 
I never thought about changing the sector interleave on the C1581... I had to that on C1571 to get audio-only to work... (I never bothered to even try video with C1571).  I was under the impression sector interleave would not matter on a C1581 because of the way it loads an entire track of data into its internal buffer.  But that is something simple enough to try...
 
Okay, I tried that.  Didn't help.
 
Any other ideas?  Something as simple to test as airship's idea would be nice  ;)
I'm kupo for kupo nuts!

Offline airship

  • 128D user
  • *******
  • Posts: 1605
  • Age: 61
  • Location: Iowa, USA
  • Activity:
    0%
  • Country: us
  • Reputation: 113
  • Gender: Male
  • Former Editor, INFO Magazine
  • With us since: 28/07/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • Atomic Airship
Re: Data Bandwidth Issues
« Reply #3 on: May 09, 2011, 01:42 AM »
Oh, I have LOTS more simple ideas!  ;D

Have you tried using a custom disk format?
   Would an MFM format be faster, since there's no GCR decoding involved?
   How about more (or fewer) bytes per sector?

I can keep 'em coming all day!
Serving up content-free posts on the Interwebs since 1983.
History of INFO Magazine

Offline BigDumbDinosaur

  • C128 user
  • ******
  • Posts: 757
  • Age: 67
  • Location: Midwest USA
  • Activity:
    0%
  • Country: us
  • Reputation: 64
  • Gender: Male
  • Yuh think donkeys are dumb, try a politician!
  • With us since: 02/01/1970
    YearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYears
    • View Profile
    • BCS Technology Limited
Re: Data Bandwidth Issues
« Reply #4 on: May 09, 2011, 01:56 AM »
Quote from: Hydrophilic
The CIA#1 timer A is used to generate the IRQs.
Are you aware of the timer A IRQ bug present in many 6526s?  Dunno if that would have a bearing on performance, but it might be worth considering.
x86?  We don't got no x86.  We don't NEED no stinking x86!

Offline Hydrophilic

  • 128D user
  • *******
  • Posts: 1214
  • Age: 41
  • Location: Earth... still!
  • Activity:
    0%
  • Reputation: 232
  • Gender: Male
  • With us since: 25/01/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • H2Obsesson
Re: Data Bandwidth Issues
« Reply #5 on: May 12, 2011, 08:21 PM »
Airship, the C1581 only does MFM as far as I know (which is what I'm testing).  Now the C1571 does both GCR and MFM but, as alluded to above, the 1571 doesn't have enough capacity for me to warrant developement effort.  As for customization, see below (part 5)...
 
BDD, is there some specific bug that C128 programmers should be aware of?  I've heard about the TOD/Alarm bug... but you have something else in mind? Considering your work on dual port serial adapter, I thought you might have some suggestions / preferences to using IRQ versus NMI.  As I understand your adapter runs about 38k (bits)which is about 4.75k byte/s and comparable to the 7.8kHz (bytes/s) interrupt rate I'm using.  So maybe you know some effecient techniques for managing that data flow ?  I also understand it uses an ACIA not a CIA, but I would think there's some similarities in programming...
 
I spent some time working on this... actually too much time!  I've put as much effort into the fast-serial routines as the JiffyDOS routines before... even thought the hardware (fast-serial) should be much simpler than software method (JiffyDOS)...
 
Ready for a sad story?  Too bad, cuz here it is!  In five parts,
 
Part 1. Investigation A
I confirmed I was getting approximately 6.66 kByte/s transfer average during sectors of a track.  The actual value was a tad less, 6.60 kByte/s.  So the track-to-track delay (mainly) and my own code occassionaly pausing to process data (minor factor) seem to drag the average down to about 5.00 kByte/s.
 
Part 2. Double IRQ rate
I spent some time (too much time) re-writing the IRQ routines for the 2MHz portion (blank) part of the video screen.  So when the CPU switches to 2MHz (25% to 40% of the time, depending on video height and NTSC/PAL standard) and if there is no video decompressing active (roughly 50% of the time), the IRQ routine would occur at twice the frequency.  Half the time it would play audio (thus audio rate is constant), and every time it would try to load data.  And what type of data increase did I get?  Less than 100 bytes/second ... less than 1% improvement!  >:(
 
Now the C1571/1581 can't send data faster than 1 byte / raster, which is about 1 byte per 64 microseconds or about 15.75 kHz.  So I wasn't expecting a full 15.75 kHz data rate... but I was expecting a bit more than 7.8kHz... and significantly more than 6.66kHz...  Unfortunately it stayed around 6.60kHz...
 
It seemed the C1581 was synchronized to my C128 and it would pick 1 of the 2 IRQs to send data.  In other words, the C1581 would transmit every other IRQ -- truly 50% of the time.  I was hoping it would transmit ASAP, which would be less than 100% (because it can't send at 15.75kHz) but more than 50%.
 
So that was a complete failure.  Double the IRQ rate and the device sends data 50% of the time... no improvement!  :'(
 
So then I thought, during the 50% that the player is not processing audio, I would put the IRQ into a polling loop.  In other words, stretch out half the IRQs in an attempt to get a byte an extra 50% of the time.
 
After working on that, the results were a very modest 2% improvement.
 
So to summarize, I was expecting data rate to go from 1 byte/IRQ to almost 2 byte / double_IRQ, but instead it only went up to 1.05 byte / double_IRQ.
 
Part 3. Polling
Annoyed by the lack of progress, I next tried keeping the IRQ rate the same, but instead having the code loop for a new byte until the end of the IRQ period.  In other words, the IRQ routine is hogging 90~95% of the CPU.  Which is okay because this only happens when the video decompression buffer is empty (the main thread is just sitting and waiting).
 
This made a small but noticeable improvement of about 25%.  But the average data rate was still well below 6.0 kByte/s (remember the 25% improvement only applies some of the time). 
 
Part 4. Early Polling (cheating)
I know from experience the C1571/81 have much higher peak data rates.  The problem is I was only in a polling loop some of the time...  got to break the loop so audio can play at some point.  So next I tried adding 'byte-request' signal to the front of the audio IRQ.  So here is what it does

;interrupt
PHA
LDA $DC0D ;CIA interrupt status (clear IRQ)+ serial-byte-ready
ORA c3po ;prior byte-ready
AND #8    ;check byte-ready
STA c3po ;save status
BEQ audio ;no byte ready
LDA #$10  ;CLK output bit
EOR $DD00 ;toggle serial line
STA $DD00 ; (request next byte)
audio:
...
LDA $DC0C ;read old byte from fast-serial bus
...

So what this does is it checks if we received a byte already.  If we have, we request another (new) byte and, importantly, we do nothing with the current (old) byte...  we leave the current byte in $dc0c and proceed to play some audio or some other thing.  Later we read the old byte from $dc0c, but we have to be careful to read it before the new byte (the one just requested) arrives and destroys the old byte.
 
It's a dangerous game.  But as long as you can count CPU cycles and the device doesn't transmit data faster than your code can get to it, it actualy works.  It works much better than the other methods.  Data rate of 8.19kByte/s.  Now that's some progress!
 
Part 5. Investigation B / Bad ROM
Now the peek rate is much higher than before (8.2k versus 6.6k).  Unfortunately the average is quite a bit lower.  Just under 6.0 kByte/s.  Oh well, guess there isn't much you can do about head-movement and disk-spinning...
 
The important thing for me is why the peak transfer was so much less than 15 kHz, considering the C1571/81 transfer data on the fast serial bus at about that speed (about 64 microseconds/byte) via hardware.
 
So I took a look at the ROM in the 1581.  And the problem is it is retarded!  So I thought, "Commodore has never been famous for fast serial routines, let's try JiffyDOS".  Sounds like a good idea, huh?  Also retarded!  From what I can tell, the fast-serial hardware, as used by Commodore and JiffyDOS, is slower than JiffyDOS software!!!!!!!!!!!!!!!!!!!!!!!!  Oh I almost forgot!!!!!!!!!!!!!!!!!!!!!!
 
So here is a comparison, first JiffyDOS (at 1MHz)

;repeat the following 4 times to get 4 bit-pairs (8 bits total)
LDA $DD00 ;4 cycles
LSR        ;2 cycles
LSR        ;2 cycles
NOP       ;2 cycles
;subtotal = 10*4 = 40 cycles
;save byte, synchronize, and loop ~ 24 cycles
;grand total = 64 microseconds

So the software method takes about 64 cycles which translates to about 15.75 kHz.  Now look at how the C1571/81 ROM does it... and JiffyDOS ROM does it too... Note the C1571/81 operate at 2MHz...

STX $400C ;4 cycles -- send data on fast serial bus
LDA flipper  ;0 cycles (occurs during transmit)
EOR #CLK   ;0
STA flipper  ;0
LDA #8      ;0 cycles -- test fast serial complete
wait
BIT $400D  ;64*2 cycles -- wait serial byte transmitted
BNE wait    ;3 cycles
RTS          ;6 cycles
####       ;~12 cycles -- read next byte and loop control
JSR transfer ;6 cycles
transfer
LDA bus    ;4 cycles
CMP bus    ;4 cycles
BNE transfer ;2 cycles (assume no loop -- stable)
AND flipper  ;3 cycles
BEQ transfer ;2 cycles (assume no loop -- handshake received)
;total 174 cycles = 87 microseconds

So the hardware method takes about 84.5 microseconds or only 11.5kHz.  I don't know why anyone would code a transfer like that.  Maybe the CIA bug mentioned by BDD?  All do know is that is slower than both the maximum possible (according to CIA hardware timer) and, more importantly, slower than software clocking by JiffyDOS.  I expect crap like this from Commodore disk drives... but I was truly shocked to find it in JiffyDOS ROM as well... sad...
 
Also note that fast-serial code (the last example above) is from JiffyDOS ROM.  The official CBM ROM is similar, only worse (more JSRs and no zero-cycle instructions).  Oh yeah, the 'zero-cycle instructions' are executed in parallel with the fast-serial transfer so they don't count towards total time.
 
*Conclusion*
Now videos are playing very well on my SD2IEC device.  Not so well on C1581 because of stupid Commodore and/or JiffyDOS ROMs.  I wonder how a CMD-HD would hold out?  Anyway, one way to fix the problem is to burn a custom ROM without stupidity built into the transfer routines.
 
I'm still open to other options...
 
I'm kupo for kupo nuts!

Offline BigDumbDinosaur

  • C128 user
  • ******
  • Posts: 757
  • Age: 67
  • Location: Midwest USA
  • Activity:
    0%
  • Country: us
  • Reputation: 64
  • Gender: Male
  • Yuh think donkeys are dumb, try a politician!
  • With us since: 02/01/1970
    YearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYears
    • View Profile
    • BCS Technology Limited
Re: Data Bandwidth Issues
« Reply #6 on: May 13, 2011, 03:10 AM »
BDD, is there some specific bug that C128 programmers should be aware of?  I've heard about the TOD/Alarm bug... but you have something else in mind?
I incorrectly referred to the problem as the timer-A bug—it actually affects timer-B.  The specific problem is if the interrupt status register (ISR) is read one or two Ø2 cycles before a timer-B interrupt is scheduled to occur, the relevant bit in the ISR will not be set and the interrupt will not occur.  Not all 6526s will exhibit this problem.  However, George Hug, a contributor to Transactor, noted that most 6526s displayed this problem, and was one of the reason why higher baud rates failed in the fake RS-232 routines in both the C-64 and C-128.  As far as I know, no such problem exists with timer-A.

Quote
Considering your work on [your] dual port serial adapter, I thought you might have some suggestions / preferences to using IRQ versus NMI.  As I understand your adapter runs about 38k (bits)which is about 4.75k byte/s...
Actually, it's 3.8 Kbps, since in addition to the eight data bits, there is a start and stop bit—10 bits total per byte transferred.  One interrupt is generated per byte transferred, so the IRQ rate is equal to the Kbps/10.
Quote
...and comparable to the 7.8kHz (bytes/s) interrupt rate I'm using.  So maybe you know some effecient techniques for managing that data flow ?  I also understand it uses an ACIA not a CIA, but I would think there's some similarities in programming...
The 8502's interrupt latency is the same for both IRQs and NMIs.  Depending on the instruction currently being executed, up to seven Ø2 cycles may elapse before a hardware interrupt is acknowledged.  Although  the MPU must test the I bit in the status register upon detection of an IRQ, the published specs don't indicate any difference in performance between the two interrupt types.

On the C-128, some interrupt processing overhead is present in the vectored code near the top of the kernel ROM and is executed before the jump is taken through the page three indirect vectors.  You can't avoid this overhead but you can arrange for your interrupt handler to be the immediate target of the indirect vector so your handler has highest priority.  If you determine that the IRQ is yours (i.e., timer-A underflow), finish your IRQ code by going directly to the
CRTI common return at $FF33 in the kernel ROM, or duplicate that code in your handler to avoid the extra clock cycles of the jump to CRTI.

One thing that could be a performance factor is the processing of the IRQ routines in the BASIC interpreter.  Those routines are responsible for moving sprites, playing music and similar such periodic activity, and with each jiffy IRQ.  If you are not using any of these features you can tell the IRQ subsystem to skip them by clearing bit zero of the INIT_STATUS flag at $00A04.  This should reduce some of the processing overhead.

Something else to look at is the structure of your interrupt handler.  You want to use as much linear code as possible.  Each
JSR will cost you six clock cycles plus the six cycles required by RTS.  Look carefully at relative branches as well.  Try to arrange your code so jumping and branching occurs only for the least common case.  Also recall that a branch across a page boundary uses an extra clock cycle to compute the MSB of the branch target's address.  If possible, use the stack for temporary storage.  PHA-PLA uses one less clock cycle than STA ADDR-LDA ADDR, as well as less  memory.

If you are using data tables and indexed addressing to access them, see if you can page-align the tables to avoid a cross-page indexing penalty.  Place the most often used data near the start of the table.  Speaking of data tables, where possible use them in place of run-time computed values.  Fewer instructions will have to be executed, ergo fewer clock cycles.

The basic problem you are up against is the relatively slow clock rate of the C-128, even in FAST mode.  The IRQ rate you are using means much of the 8502's time is being consumed in servicing IRQs.  Potentially, 21 clock cycles will be consumed per IRQ in MPU internal overhead
—before your IRQ handler actually gets anything done.  This breaks down to seven clocks to account for interrupt latency, seven clocks in front end overhead (pushing the PC and SR to the stack, and loading the PC with the IRQ vector) and seven more executing RTI, which reverses the stack pushes performed when the IRQ was initially processed.  Multiply that by your interrupt rate (7800) and you are consuming up to 163,800 clocks per second in MPU overhead—and that's before you actually do any useful work.  In other words, 8.2 percent of the clock cycles available in one second is potentially consumed in internal MPU overhead.

To put that into perspective, the fastest MPU instructions require two clock cycles and, on average, you will consume four to five clock cycles with the most commonly used instructions.  Let's assume four for now.  At 2 MHz, you can theoretically execute 500,000 instructions per second, assuming no interrupts of any kind.  Now, let's subtract the MPU's internal IRQ processing overhead.  At an interrupt rate of 7800 per second, that leaves you with 459,050 instructions that can be executed per second, both for interrupt and foreground processing.  We haven't accounted for the initial kernel code executed at each interrupt (register pushes, changes to the MMU, testing the BRK bit, etc.), the code executed to pull registers and restore the memory map at the end of the interrupt handler, or the additional load of the 60 Hz jiffy IRQ, which potentially eats up another 1260 cycles each second in MPU internal overhead, as well as in executing the front and back end IRQ handler code.  I think you can see where the conversation is headed.

Since your code is timing-critical, you need to go through it top to bottom to weed out as much inefficiency as possible.  I think the 7800 IRQ rate is sustainable, but not unless your IRQ handler is very tight.

Quote
*Conclusion*
Now videos are playing very well on my SD2IEC device.  Not so well on C1581 because of stupid Commodore and/or JiffyDOS ROMs.  I wonder how a CMD-HD would hold out?  Anyway, one way to fix the problem is to burn a custom ROM without stupidity built into the transfer routines.
 
I'm still open to other options...
Based on my experience with the Lt. Kernal and direct access methods, it would easily sustain the required data rate.  Even back when it was using ST-412 drive mechanisms, it was possible to load a 512 byte block into RAM in about 30 milliseconds, worst case, an effective data transfer rate of 17K per second.  On average, transfer rates were much higher, sometimes approaching 65K per second.  The CMD drive was never able to reach that transfer rate.
x86?  We don't got no x86.  We don't NEED no stinking x86!

Offline Hydrophilic

  • 128D user
  • *******
  • Posts: 1214
  • Age: 41
  • Location: Earth... still!
  • Activity:
    0%
  • Reputation: 232
  • Gender: Male
  • With us since: 25/01/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • H2Obsesson
Re: Data Bandwidth Issues
« Reply #7 on: May 13, 2011, 11:20 PM »
Thanks for the detailed reply and especially info about the Timer B bug.  Fortunately I'm using Timer A for interrupts.  This is possible when reading fast serial data, but not when sending (CIA#1 Timer A clocks fast serial output). 
 
Fortunatly for my purposes, I'm only reading fast serial data once the IRQ routine is activated.  I would be forced to use NMI (instead of IRQ) on CIA#2 if I ever needed to transmit data... or else risk the Timer B bug of CIA#1...
 
This may or may not be relevant to the Timer B bug... I have Timer B counting Timer A interrupts.  This is so each frame of video can be displayed at the correct time by comparing Timer B with a time-stamp encoded in the file (in the last packet of the each frame).  I say it *can* but I haven't actually coded that part yet!  Currently it just displays the frame as soon as it is decoded. 
 
In other words, Timer B is counting IRQs of Timer A, but I'm not yet testing Timer B.  Timer B does not generate interrupts.  The only interest is the value in Timer B as measure of absolute time.  So I guess if strange things start happening then I'll have to switch to a software method.  Or, because video seems to almost never be more than 5 fps (0.2 seconds), I guess I could use the TOD of either CIA which has 0.1 second resolution...
 
That's good info, BDD, about considering interrupt overhead.  I've kept this to a minimum (as best I can).  Here is how I'm doing it, for anybody who might find the info helpful or maybe somebody can suggest an idea for improvement...
 
The main thread of the program is ML and does not use KERNAL routines once playback starts.  So several cycles are saved by not using the KERNAL vectoring of IRQ.  So instead of changing the KERNAL vector at $314, the CPU vector at $FFFE is changed.  Let's compare!
 
KERNAL IRQ Vector Method (part 1)

;CPU interrupt - 7 cycles
PHA   ;3 cycles
TXA   ;3 cycles
PHA   ;3 cycles
TYA   ;3 cycles
PHA   ;3 cycles
LDA $FF00 ;4 cycles
PHA   ;3 cycles
LDA #0 ;2 cycles
STA $FF00 ;4 cycles
;test for BRK or real IRQ
TSX   ;2 cycles
LDA $105,X ;4 cycles
AND #$10 ;2 cycles
BEQ doIRQ ;3 cycles
doIRQ:
JMP ($314) ;5 cycles

That's 51 cycles (assuming I counted correctly) before the IRQ routine goes to work.  After it does its job, the exit routine at $FF33 is executed...
 
KERNAL IRQ/NMI Vector Method (part 2)

PLA   ;4 cycles
STA $FF00 ;4 cycles
PLA   ;4 cycles
TAY   ;2 cycles
PLA   ;4 cycles
TAX   ;2 cycles
PLA   ;4 cycles
RTI   ;6 cycles

That's 30 cycles until the originally running code can resume.  So the total overhead is 81 cycles for KERNAL vectoring of IRQ.
 
It should be pointed out that NMI uses the same 'resume' code immediately above.  The actual vectoring of NMI by the KERNAL is very similar to the first block of code, but it only takes 40 cycles instead of 51 because it doesn't need to check the stack for BRK.  So the total overhead is 70 cycles for KERNAL vectoring of NMI.
 
So I guess that shows if you want to use the KERNAL, you should choose NMIs if possible; it will save you 81-70 = 11 cycles per interrupt.  Of course if you're code isn't time-critical, then this difference may not be important.
 
To process interrupts with the CPU vector, ROM at the top of memory must be disabled.  Well, I guess another method would be to burn a custom ROM.  Anyway, this is how my code does it...
 
CPU Vector Method

;CPU interrupt - 7 cycles
PHA   ;3 cycles
; interrupt routine here
PLA   ;4 cycles
RTI   ;6 cycles

So the total interrupt overhead is normally 20 cycles.  On occassion, where the code is not re-entrant, this gets reduced to 19 cycles by saving .A into a zero page address; specifically, PHA/PLA takes 7 cycles, but STA z/LDA z takes only 6 cycles.  Either way, this is about 3 or 4 times faster than the KERNAL NMI or IRQ vectoring method.
 
Some ofther things should be pointed out regarding this CPU vectoring method (besides having ROM disabled).  The memory configuration should not change or at least always remain 'interrupt friendly'.  So the I/O registers must always remain visible and you would have to confine code to one bank of RAM or make provisions otherwise.  In my code, I have common RAM at the top of memory (where the interrupt routines and data buffers reside) so I can freely switch RAM banks in the main-line code.  An alternative would be to duplicate the code in both RAM banks and ensure the data used by the interrupt code was in common RAM.  However, this alternate method prohibits the interrupt routine from using self-modifying code.  Of course the simplest method would be to never change the memory configuration... but then you might as well just use a C64...
 
It should be obvious that a large a part of the 'savings' using the CPU vector method is that X and Y are not saved and restored during each interrupt.  If you really need one or both, then you would have to add extra cycles to save and restore it/them.
 
It should also be obvious that a large savings of the CPU vectoring of IRQ is there is no test for BRK.  So you can't use BRK or else you would have to add 10 or 11 cycles worth of code.
 
For reading audio buffer and storing serial data, my interrupt code is using self-modifying code and absolute addressing.  Like this...

getA:
LDA $e000  ;4 cycles
... ;process audio
INC getA+1 ;6 cycles
BNE serial  ;3 cycles
;page wrap, 15 or 16 cycles
LDA getA+2
CMP # >buffer_limit -1
BCC setAh
LDA # >buffer_start -2
setAh:
ADC #1
STA getA+2
serial:
...

So accessing the audio buffer and bumping the pointer takes 13 cycles in the usual case.  On a page wrap, an extra 14 or 15 cycles are needed.  The code above states 15 or 16, but because the BNE is not taken, 1 is subtracted.  Anyway, for a page wrap, the shorter time is typical, but the longer time is needed for a buffer-wrap.  This wouldn't be neccessary if the data could fit in a 256 byte buffer, but that is just WAY to small for audio buffering.
 
I tried using indexed direct (aaaa,X) and indirect indexed addressing ((z),Y) but although it may seem you could save 1 or 2 cycles, when you count the fact you have to save and restore an index register and load it with a meaningful value, it actually takes many more cycles!
 
The serial-data-store is very similar, but it doesn't have to deal with buffer-wraps.  So it is a bit faster when dealing with a page wrap.  This should be obvious but I'll post it anyway

;A has data from serial bus
wrtS:
STA memory ;4 cycles
INC wrtS+1 ;6 cycles
BNE exit   ;3 cycles
INC wrtS+2 ;6 cycles
exit:

So it takes 13 cycles in the typical case or 18 cycles for a page wrap.
 
One idea to speed things up is to re-write the code so we branch only on the exceptional case and not in the usual case.  In the last example, this would mean locating 'exit' immediately after the branch instruction and changing BNE to BEQ.  However, that means you would have to duplicate the code following 'exit' or add another branch/jump instruction to the longer exception code (thus the exception code which is longer to begin with becomes even longer).  The extra time for the exception is generally okay because it only happens once every 256 bytes.
 
I've used that technique in a few places where the CPU as running at 2MHz.  But not when running at 1MHz; because the timing is so tight, I thought it was more important to reduce the worst-case time.
 
I'm kupo for kupo nuts!

Offline BigDumbDinosaur

  • C128 user
  • ******
  • Posts: 757
  • Age: 67
  • Location: Midwest USA
  • Activity:
    0%
  • Country: us
  • Reputation: 64
  • Gender: Male
  • Yuh think donkeys are dumb, try a politician!
  • With us since: 02/01/1970
    YearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYearsYears
    • View Profile
    • BCS Technology Limited
Re: Data Bandwidth Issues
« Reply #8 on: May 14, 2011, 01:44 AM »
Given what you are trying to accomplish, using CIA #2 makes more sense.  Aside from the Restore key, CIA #2 is the only device in the computer than can generate an NMI.  If you use NMIs to drive your player, you not only get to skip the BRK bit test code at the start of the interrupt handler, you can make some assumptions about the source of the interrupt and thus minimize polling overhead.  That is, if only timer-A in CIA #2 has been configured to generate an NMI, just read CIA #2's ISR to clear the pending interrupt, ignore the flag bits (you already know which one is set) and move on.  You'd only have to test flag bits if you set up another interrupt source.

Incidentally, ganging timer-B to timer-A in itself causes no problems.  The trouble arises if both timers are set up to generate interrupts and happen to hit at about the same time.  Let's suppose timer-A interrupts.  Let's also suppose that by the time the MPU responds to the interrupt and the front end code has been executed, four
Ø2 clock cycles remain before timer-B is scheduled to interrupt.  During those final clock cycles, you read the ISR.  That is when the timer-B bug may appear.  So timer-B continues to run but doesn't interrupt, and your program gets all out of whack because of the resulting timing error.

The obvious solution is to never use timer-B as an interrupt source if other sources are configured in the same CIA.  It'll complicate coding but not much you can do about it.
x86?  We don't got no x86.  We don't NEED no stinking x86!

Offline Hydrophilic

  • 128D user
  • *******
  • Posts: 1214
  • Age: 41
  • Location: Earth... still!
  • Activity:
    0%
  • Reputation: 232
  • Gender: Male
  • With us since: 25/01/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • H2Obsesson
Re: Data Bandwidth Issues
« Reply #9 on: May 15, 2011, 11:40 AM »
It's good to know that simply running Timer B (no interrupts) is okay; thanks BDD!  I've been thinking for quite a while now about using CIA#2 as you suggested.  However, unlike CIA#1, CIA#2 generates NMI which, by definition, is non-maskable.
 
This wouldn't be a problem if my code could always execute before the next interrupt.  Unfortunately there is an exception.  When a VIC DMA occurs (1 every 4 IRQs) and there is an audio buffer page-wrap (once every 4*256=1024 IRQs).  So if my math is correct, the time over-run occurs 1 / 4 / 1024 times... or 1 in 4096.  Not very often.... but since interrupt rate is about 7800 times per second, then it very likely to happen at least once per second.
 
So blindly switching the code to use CIA#2 would probably crash the machine in a second or two!  Actually, the problematic routine does not load any data from the serial bus nor write any data to memory; it only reads from the audio buffer.  Thus is probably wouldn't crash the system, but it would royally mess up the audio playback!
 
Anyway, here is a comparison of the two methods (note I haven't been brave enough to try method 2 / NMI yet).  I actually meant to put this in my previous post.
 
The important thing to know about method 1 is that the current CIA flags must be merged with old status flags but the CIA automatically clears its flags when you read $dc0d (or $dd0d).
 
Method 1 = Fast Serial + Timer A CIA#1

LDA c3po  ;old byte-ready status
ORA $DC0D ;clear IRQ and merge new byte-ready status
STA c3po  ;save current byte-ready status

That takes 10 cycles in the normal case (c3po is in zero page).  Sometimes the code is ready to process the serial bus data, in which case the STA is not immediately neccessary; this saves 3 cycles in the short term, but eventually the byte must be reset.  So it is 10 cycles either way; just a question of now or later.
 
Method 2 = Fast Serial + Timer A CIA#2

BIT $DD0D ;clear NMI
LDA $DC0D ;get current byte-ready status

This takes 8 cycles, which is 2 cycles less or 20% faster than Method 1.  This is possible because there is no need to save the byte-ready status.  If for some reason the code isn't ready to process fast-serial data, the reference to $dc0d can be changed to another non-destructive memory location; in other words $DC0D would not be accessed and it will 'remember' the byte-ready status until the code is ready to process it.  Note this assumes self-modifying code is applicable (code in RAM); otherwise you would need to save the status into a variable and then it would be just like Method 1 !
 
I agree with BDD; this is the preferred method, assuming you can garantee the code will complete before the next NMI.  Unfortunately for my application, this can't be gauranteed... so far... After re-examining my code (for the 7 thousandth time!) I think it may be possible to shuffle around the code to gaurantee in-time completion during a VIC DMA + audio page-wrap.
 
Currently I'm using CIA#1 because it generates IRQs which can be blocked.  The blocking of the next IRQ is only temporary because as soon as the slow / not-in-time interrupt finishes, the next one occurs.  Of course it will be delayed by a few cycles.  According to my math, this is never more than 18 slow (1MHz) cycles which should produces an audio distortion of 55 kHz or more (not audible).
 
The only way I know to block an NMI is by not clearing the NMI source.  So using CIA#2, I could postpone reading $DD0D... thus preventing an NMI from corrupting the audio code in the case where it is too slow.  But (this is very important), that means the next interrupt would be skipped.  This would result in an audio distortion or about 7.8kHz (very audible) and also throw off the synchronization of the video (because it is counting on the Timer A interrupts).
 
Right now I need to decide how/if I can re-arrange my code to gaurantee in-time completion so that NMIs could be used.  But enough about me!
 
This thread relates to a current project of mine, only as a concrete source of examples.  I really hope it could have more wide-spread applicability.
 
For example, I have read many books and articles relating to interrupts and data buffering on the 6502 CPU, but they all seem to be geared to entry-level programmers.  I guess the authors figure it is important to teach how basic interrupt processing works; if the reader can get some basic code working, then that is all the author cares about...
 
I've found a few advanced examples in the Commodore Hacking series of e-zines, but they are mostly related to VIC-II tricks.  In other words, they relate to interrupts, but only in regards to special video effects, and not general purpose data processing issues.
 
OFF TOPIC -- C=Hacking is offline!  I think this is/was a great resource for Commodore enthusiats.  It has for a long time been hosted over at zimmers, and the site still seems to be working... except for my/their link to C=Hacking articles.  Fortunately I have a few of them on my hard drive, but their are many I am missing.  If anybody has a working link, please reply... thanks!
 
ON TOPIC
 
Another good online resource for 6502 code is (duh) 6502.org.  The math routines, especially fast multiply and fast square root are pure genius!  Not so much info in the interrupt processing / data through-put department unfortunately.  They do have a few good articles relating to the 6522 VIA as in the VIC-20, C1541, and C1571.  There is also an article, Interrupt-Serviced 256-Byte Data Buffer by Lee Davison.
 
As mentioned in my previous post, 256 bytes is WAY to small for an audio buffer.  So I'm going to elaborate on 16-bit processing next.  But if anybody knows of some good on-line info about > 256 buffer processing on 6502 (or similar machines, like 6800 or Z80), please post some links... thanks!
 
Previously I posted how the 16-bit audio buffer pointer is incremented.  Now I present some multi-byte decrement routines.  The first should be familiar to the experienced 6502 programmer, so I'll just list it without much detail; ask if you have any questions!  All the following examples asume there is no extra cycle for the branch intstuction; I think we should all know to add 1 cycle if a branch should cross a page boundry.
 
Traditional 16-bit Decrement

LDA countL ;3|4 cycles (2|3 bytes)
BNE noWrap ;2|3 cycles (2 bytes)
DEC countH ;5|6 cycles (2|3 bytes)
noWrap:
DEC countL ;5|6 cycles (2|3 bytes)

If 'count' is in zero page, the routine takes 11 cycles in the usual case or 15 cycles for a page-wrap (13 or 18 cycles if not zero-page).  It consumes 8 bytes (zero page) or 11 bytes (not zero-page) of code.  Either way, the A register gets destroyed!  If A is important, then you need to waste 4 ~ 8 cycles (or more) with 2, 4, or 6 bytes of code.
 
In my opinion, the only advantage to this method is there is no special "setup" required.  Just store the "natural" 16-bit value into 'count' and you're ready to loop!
 
One way to improve the cycle time is to take better advantage of the 6502 automatic Z-flag updates.  So now I present a faster method.  It seems like I may have come across this in code I've disassembled, although I don't think I've ever read about it...
 
Reverse 16-bit Decrement

INC countL ;5|6 cycles (2|3 bytes)
BNE noWrap ;2|3 cycles (2 bytes)
INC countH ;5|6 cycles (2|3 bytes)
noWrap:

If 'count' is in zero page, the routine takes 8 cycles in the usual case or 12 cycles for a page-wrap (9 or 13 cycles if not zero-page).  This 'main' part consumes 6 bytes (zero page) or 8 bytes (not zero-page) of code.  Either way, the A register is preserved!
 
Assuming you gain nothing by preserving the A register, this "reverse method" will still be 20~27% faster than the "traditional method" (depending on zero page or not).  You gain around 50% speed savings if A is important (depending on how you save/restore the A register).
 
Put another way, each loop saves 3 or 4 cycles (zero page or not) in the usual case (no page wrap) as compared to the "traditional method".  We'll come back to this point shortly...
 
Now I know you're probably thinking "that's INCrement not DECrement".  And you would be correct based only on that code fragment.  But let me explain.  In order to get the speed savings, there is some setup required.  Basically you negate the 16-bit value before the loop so that the above INC opcodes are really DEC in disguise!
 
Setup Reverse 16-bit Decrement

SEC     ;2 cycles (1 byte) -- may sometimes be omitted
LDA #0  ;2 cycles (2 bytes)
SBC countL ;3|4 cycles (2|3 bytes)
STA countL ;3|4 cycles (2|3 bytes)
LDA #0  ;2 cycles (2 bytes)
SBC countH ;3|4 cycles (2|3 bytes)
STA countH ;3|4 cycles (2|3 bytes)

The SEC instruction may be omitted if you know the carry flag is already set.  Because I'm an optimist, let's pretend it can be omitted!  Note, if you know the carry flag is clear, you can change the first "LDA #0" into "LDA #1" and still get the benefit of omitting SEC.  I'm not an optimist for nothing!
 
Anyway, this setup requires 12 bytes of code (zero page 'count') or 16 bytes of code (not zero-page).  It also requires 16 or 20 cycles of CPU time (zero page or not).  Besides that, the Carry and Overflow registers will get changed; this may or may not be important... in fact, Carry will normally be clear after this setup (which can be useful to know for your code that follows this setup).
 
Because the main (loop) part of the code saves you 3 or 4 cycles, you will break-even if the loop executes 5 times.  You actually loose cycles if you loop less than 5 times.  The important thing is you save cycles -- potentially many cycles -- if you loop more than 5 times.  Considering it's a 16-bit counter, it should be presumed you will loop 256 or more times on average.  For 256 loops, you should save around 753 cycles (zero page) or 1004 cycles (not zero-page).
 
I think that is a pretty good deal!  The only drawback is the 12 or 16 bytes of setup code and the corresponding time of 16 or 20 CPU cycles (again, zero-page or not).
 
Can it get any better?
 
Here is the method I am using.  I won't dare say it is orignal... everytime I think I've done something new, people always point out it has been done before.  I guess that is to be expected in a world of over 5 billion people!  But I will say I've never seen it mentioned in any books or articles.  And I've never seen it used in any of the Commodore programs I have disassembled (which would be quite a lot).
 
Hydro 16-bit Decrement

DEC countL ;5|6 cycles (2|3 bytes)
BNE noWrap ;2|3 cycles (2 bytes)
DEC countH ;5|6 cycles (2|3 bytes)
noWrap:

Just like the "reverse method", if 'count' is in zero page, the routine takes 8 cycles in the usual case or 12 cycles for a page-wrap (9 or 13 cycles if not zero-page).  This 'main' part consumes 6 bytes (zero page) or 8 bytes (not zero-page) of code.  Either way, the A register is preserved!
 
This is very similar to the "reverse method" except we really are using DEC opcodes.  The advantage is a much simpler setup routine...
 
Setup Hydro 16-bit Decrement

LDA countL ;3|4 cycles (2|3 bytes)
BEQ no_mangle; 2|3 cycles (2 bytes)
INC countH ;5|6 cycles (2|3 bytes)
no_mangle:

This setup code requires only 6 or 8 bytes and normally executes in 8 or 12 cycles (zero page 'count' or not).  Another important feature is the Carry and Overflow flags are preserved... this may or may not be important.
 
Like the "reverse method", the main (loop) part of the code saves you 3 or 4 cycles.  However this is better because you will break-even if the loop executes 3 times (as opposed to 5 times).  And of course you benefit from setup code that is only half the size!  You get 1, and only 1, guess as to which method I prefer :)
 
Silly me, I almost forgot!  Although the traditional method will reliably decrement a 16-bit counter, it has no easy way to detect zero...  I think that is very important!!  I imagine you would need to waste at least 6 bytes and 8 cycles (or more) to accomplish such a "simple" task.
 
However, for both "Reverse Decrement" and "Hydro Decrement", it is very easy to detect zero... just add a 2-byte BEQ instruction at the end... sweet!
 
Well, that's about all the code examples and links I have for today.
 
I'm kupo for kupo nuts!

Offline Hydrophilic

  • 128D user
  • *******
  • Posts: 1214
  • Age: 41
  • Location: Earth... still!
  • Activity:
    0%
  • Reputation: 232
  • Gender: Male
  • With us since: 25/01/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • H2Obsesson
Re: Data Bandwidth Issues
« Reply #10 on: May 22, 2011, 05:15 PM »
Previously, I had peek data rate of about 6.6 kByte/s.  I did some more work on the IRQ routines, and now have the peek data rate up to 8.2 kByte/s.  Unfortunately the average data rate is only around 6 kByte/s.  This ends up producing a wide-screen (16:9) video with a rate of about 4 frames per second, or a full-screen (4:3) video with a rate of about 3 frames per second.  Really sad compared to modern equipment, but this fast-serial version is about 2~3 times faster than the JiffyDOS version... so I'm pretty happy...
 
Previously the IRQ loader was doing this:

LDA $DC0C ;get data byte
;the next pair of DEC/BEQs take 14 cycles in the usual case
DEC packet_size_low
BEQ packet_wrap
DEC sector_remainder
BEQ get_status_byte
STA memory

 
As mentioned in an earlier post, this can be improved to

LDA $DC0C ;get data byte
;the following DEC/BEQ takes 7 cycles in the usual case
DEC normal_count
BEQ exception
STA memory

Well, I finally implemented that idea.  And as I guessed before hand, the code for calculation of 'normal_count' is rather messy!  That set-up calculation (and the follow-up code 'exception') requires about 40 cycles and almost as many bytes of code.  Of course about a dozen bytes of code are saved by omitting one of the DEC operands (that code gets duplicated a couple of times).  Code space was already pretty tight, and it took quite some effort to make it all fit.
 
Anyway, it costs about an extra 40 cycles per 254-byte sector of data.  But it saves 7*255 = 1785 cycles.  So it's a pretty good deal, assuming you have the memory needed for the extra code.
 
While testing that, I tried to get a better idea of the time delay between each sector transmitted by the C1581.  It seems to take about 3 or 4 IRQ periods.  This is about 4*2*64 = 512 slow (1MHz) cycles.  But the C1581 is running at 2MHz... so it is taking somewhere around 1024 cycles!!
 
Of course you expect a huge delay between sectors of a C1571, but because the C1581 buffers the entire track, it shouldn't take very long at all.  It definately shouldn't take 1000 cycles...
 
So I was thinking, the C1581 probably runs with interrupts disabled while transmitting a sector and then re-enables IRQs when done (before switching to next sector buffer and sending new status byte).  I haven't examined the code of the C1581 in detail, but the C1571 works that way.  Assuming this is correct, and because it takes quite a while to transmit the full sector, then there would be an IRQ waiting to execute as soon as the sector finishes transmitting.  Now I can understand how running the IRQ routine and switching to a new sector could take around 1000 cycles... so now it makes sense...
 
These inner workings of the C1571/81 are not only academic.  There is a very practical benefit to gain by exploiting this knowledge!  It all comes down to timing of the status byte, as you may have guessed.
 
So previously my IRQ code was doing something like this

LDA $DC0C
DEC sector_remainder
BEQ status_byte
STA memory

Importantly, sector_remainder was initialized with size+1 (255 to be precise).  So 'size' bytes (254) would be stored to memory, and then on the next pass, the remaining +1 would DEC to zero.  Thus the branch would be taken, and importantly, the status byte would not be stored to memory (it gets tested).
 
Hopefully that is simple enough to understand.  The code was fairly simple, I think.  Anyway, to understand the benefit relating to status byte timing, you should also know that quite a bit of extra code needs to be ran at the end of the sector (reception of status byte).
 
Because of the combination of 'packet_length' and 'status_byte' into new variable 'normal_count', the extra code can actually be quite a lot.  The important thing is it robs the CPU of the time it would otherwise use to load another 1 or 2 bytes.
 
So a few potential bytes get lost on a new sector / new packet which should be expected.  But if you look at that simple code and think about the timing of the C1581, you should see it results in 'the worst of both worlds'.
 
Let me explain in case it isn't obvious.  After sending 254 bytes of data, the C1581 will have a delay of about 6 bytes worth of time.  The sector_remainder in our IRQ code has a value of 1.  When the C1581 finally starts sending the next sector, starting with a status byte, that is when we DEC to zero and run the exception code which delays another 2 bytes worth of time.  That is, the delays are sequential and add up to about 8 bytes worth of 'wasted' time.
 
If we could optimize the code so that our 2 bytes of processing time occurs in parallel with the 6 bytes of C1581 processing time, then we would have only 6 -2 = 4 bytes of "waste" (as opposed to 6+2 = 8 waste).  Put another way, that should result in 8 wasted (old code) - 4 wasted (new code) = 4 not wasted cycles (new code).  Thus we 'gain' 4 bytes for every 254 bytes.
 
That may sound like a good idea, and I did implement it.  But if you think about it, 4/254 is only about a 1.5% improvement.  And this calculation is only an approximation... I imagine the actual improvement would be less.  I should have tested before and after...
 
Anyway, getting that extra 1.5% comes at price, as you may have guessed!  This is discussed a bit further down.  But first I want to explain how I did it.  Maybe somebody can suggest a better idea...
 
Instead of putting size+1 into 'sector_count' and performing DEC before writing to memory, I put the actual size into 'sector_count' and performed DEC after writing to memory:

LDA $DC0C
STA memory
DEC sector_remainder
BEQ status_byte

So now at the end of the sector, the code for status_byte gets called.  This does all the extra processing that costs about 2 bytes worth of time.  Simultaneously the C1581 is doing its 6 bytes worth of next sector stuff.  Finally it transmits the status byte for the new sector which we quickly check and reset the sector_remainder and continue normal loading.
 
Now comparing before and after (the last two code blocks), it may not be obvious how there is a price to pay for the improvement.  But if you look again at the last code block (the "new and improved"), you should notice that the write to memory occurs before testing sector_remainder.  We'll corrupt memory by writing a status byte instead of file data unless this is accounted for!  Instead of using subroutines (which have a 12-cycle overhead every call), my code uses self-modifying in-line code.  The "price to pay" is changing the STA into BIT after reading the last byte, and then changing BIT back into STA after reading the status byte.  So there is 6 * 2 = 12 cycle overhead, but only for each sector, not every byte!
 
The above code was simplified for discussion.  In reality the memory pointer also gets updated at the same time.  The "improved" code requires special consideration for that as well.  An additional 20~30 cycles and 7 extra bytes of code.  Also, because of 1MHz and 2MHz code sets, the BIT/STA switch actually costs 10*2 = 20 cycles per sector and consumes 16 bytes of memory.  All said, the "price" is about 50 cycles and 23 bytes of memory.
 
In conclusion, the elimination of 7 cycles per every byte loaded, coupled with the (dubious) status-byte-synchronization has pushed the peek data rate up by about 24% (from 6.6 to 8.2 kByte/s).  Importantly, this has reduced the worst-case IRQ time to almost completely avoid stacking of interrupts.  In fact I went back and re-arranged the code so that I *think* it has been completely avoided.  I still need to double-check the 1MHz and 2MHz transition routines.  But I did check this code re-ordering does not reduce the peak data rate.
 
Anyway, the average data rate is now about 6.0 kByte/s and producing 4 fps for wide-screen video using a C1581.  Next I'm going to try using NMI; that should let me know pretty quick if there is any interrupt stacking by a means of a system crash :)  This should squeeze out a few more percentage in bandwidth.  Hopefully it will allow the C1581 to achieve 6.6 kByte/s average.  This is already exceeded with uIEC.  The reason for this wish is that you get about 5 fps with that bandwidth.  It may not sound like a lot, but you sure can see it!
I'm kupo for kupo nuts!

Offline Hydrophilic

  • 128D user
  • *******
  • Posts: 1214
  • Age: 41
  • Location: Earth... still!
  • Activity:
    0%
  • Reputation: 232
  • Gender: Male
  • With us since: 25/01/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • H2Obsesson
Re: Data Bandwidth Issues
« Reply #11 on: May 31, 2011, 03:54 PM »
With the re-arranged code to avoid stacking of interrupts, I finally tried out using NMIs instead of IRQs.
 
As mentioned previously, this will usually save 2 cycles per byte loaded.  However, during the 2MHz polling routine, it actually adds 1 cycle to the first byte loaded in a loop.  Fortunately it works out that, on average, at least 1 cycle / byte is saved.
 
Long story made short, the max data rate increased from 8.2 kByte/s to 8.3 kByte/s.  I am under impressed.  It simplified the code in some cases and added a few bytes in other cases.  Overall, code size is about the same... probably a tiny bit smaller.
 
So if raw bandwidth were the only issue, I'd say go for it.  But there is another issue.  The RESTORE key.  This does not set any flags in the CIA#2 register.  So the only possible way to test for it is by testing the absence of Timer flag.  Well, this test (BPL) adds 2 cycles to every byte loaded.  Thus, overall, there would be a loss of 1 cycle per byte were I to implement this feature.
 
More importantly, this test is not reliable!  Imagine you hit RESTORE.  Several cycles later when the CPU actually tests CIA#2, it may have already generated a Timer interrupt.  Thus the RESTORE key would be lost.  Not a huge problem as the user could just keep banging on the RESTORE key until it was finally recognised.  But if I were to add 2 cycles and 2x bytes to my code, I would want it to work reliably.
 
I don't think there is anyway (in software) to make this work reliably.  Please post if you can think of a method!  Thus I will stick with IRQs.
 
I did think of a method to increase bandwidth throttling (for lack of a better term).  The data rate is constant in the 1MHz portion of the screen, but in the 2MH portion, the data rate is based on CPU usage: if the CPU is decoding a video frame, the data rate is "single" (1 byte/interrupt), but if the CPU is idle (only waiting for a packet to load) then data rate is "multi" (2+ bytes/interrupt).  By the way 2 bytes/interrupt is the max possible with C1581.  It might could be 3/interrupt with other devices (not tested/confirmed...)
 
Currently the "throttle speed" is only updated once per VIC frame.  50 or 60 Hz depending on video standard.  But this means the CPU is "loosing" time if wants to decode a video frame and the loader is still in "multi" mode.  Conversly, their is a bandwidth "loss" if the CPU is idle and the loader is still in the "single" mode.
 
So imagine that half the time, either CPU processing or data bandwdith is sub-optimal.  Thus I'm thinking to move the "throttle control" code out of the interrupt and into the packet demuxer and video decoder. 
 
I haven't done it as it requires a bit of a split of the code (from IRQ to demuxer,decoder)... the code is messy already!  But, I'm thinking it should provide a better improvement than switching from IRQs to NMIs (only 0.1 kByte/s improvement).
I'm kupo for kupo nuts!

Offline Hydrophilic

  • 128D user
  • *******
  • Posts: 1214
  • Age: 41
  • Location: Earth... still!
  • Activity:
    0%
  • Reputation: 232
  • Gender: Male
  • With us since: 25/01/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • H2Obsesson
Re: Data Bandwidth Issues
« Reply #12 on: June 04, 2011, 10:10 AM »
I implement data rate throttling outside the IRQ loader.  As expected it did not affect the peak data rate.  I was hoping it would increase the average data rate.  Unfortunately it reduced the average data rate... grr... I'm thinking because the extra 2x-loading that was happening before out-weighed the benefit of early 2x-loading with the new method.
 
So before it would throttle the data rate exactly once per VIC field in the IRQ routine.  That is, the 1x and 2x codes were both in the IRQ.  The alternate method tested and reported above had the 1x and 2x both outside the IRQ.  The important thing is the symetry -- both transition codes are either inside the IRQ or outside the IRQ.
 
So next I thought to take a non-semetrical approach.  In this method, the mainline code will enable 2x loading as soon as it finishes decoding a video packet (assuming another one isn't already waiting).  But the switch to 1x loading only occurs in the IRQ, once per VIC field (if it notices the mainline code is processing video).
 
I tested this method too.  Again, it did not change the maximum data rate, but it did improve the average data rate a few percent.  Really hard to give an exact value, since it depends greatly on the video in question.
 
Although this non-symetrical method does not improve available CPU time for the main thread like the symetrical non-IRQ method, it increases the average data rate without decreasing CPU main thread time (as compared to the original, symetrical in-IRQ method).  So I think I will stick with it.
 
The only drawback, compared to the original method, is more code bytes.  Originally, a single IRQ routine would test the main thread and set either 1x or 2x loading speed as approriate.  But the non-semetrical case has two different routines for throttle control (one in IRQ and another in mainline code).  So if I run low on code space, I may revert back to the original method.
 
That's about all the ideas I have...
 
Well another idea would be to go back to NMI for primary interrupts (that is, only for audio playback) and program CIA#1 to generate IRQs for fast-serial byte-received.  But because there is at least a 20 cycle overhead for each interrupt, this would probably make things slower... I'm not even going to try it!
 
 
 
I'm kupo for kupo nuts!

Offline airship

  • 128D user
  • *******
  • Posts: 1605
  • Age: 61
  • Location: Iowa, USA
  • Activity:
    0%
  • Country: us
  • Reputation: 113
  • Gender: Male
  • Former Editor, INFO Magazine
  • With us since: 28/07/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • Atomic Airship
Re: Data Bandwidth Issues
« Reply #13 on: June 05, 2011, 02:40 AM »
HP-

Maybe you're running up against the limits of current physics. I suggest employing tachyons. :D

Seriously, you're learning a lot, and I really enjoy learning along with you, even though you haven't gotten the results you were hoping for. Yet!

Hang in there.
Serving up content-free posts on the Interwebs since 1983.
History of INFO Magazine

Offline Hydrophilic

  • 128D user
  • *******
  • Posts: 1214
  • Age: 41
  • Location: Earth... still!
  • Activity:
    0%
  • Reputation: 232
  • Gender: Male
  • With us since: 25/01/2007
    YearsYearsYearsYearsYearsYears
    • View Profile
    • H2Obsesson
Re: Data Bandwidth Issues
« Reply #14 on: June 05, 2011, 06:48 AM »
Thanks for the words of encouragement Airship!  You really are the idea man... I hadn't thought about tachyons!  So I'm thinking with anti-C1581 drive to cancel out the normal C1581, the data should arrive before it is ever transmitted... or at least arrive instantly :)
 
Seriously, these last few rounds of code changes have provided only minimal improvement in bandwidth while introducing extra costs.  So I tend to agree, that the fast-serial interrupt-loader is the near the limit.  Time for me to switch gears and work on the video encoder.
 
I'm kupo for kupo nuts!

Offline RobertB

  • Forum god
  • ********
  • Posts: 2874
  • Location: Visalia, California
  • Activity:
    3.4%
  • Country: us
  • Reputation: 451
  • With us since: 05/06/2006
    YearsYearsYearsYearsYearsYearsYears
    • View Profile
    • Fresno Commodore User Group
Re: Data Bandwidth Issues
« Reply #15 on: June 05, 2011, 07:04 AM »
So I'm thinking with anti-C1581 drive to cancel out the normal C1581...
     Unless controlled in a magnetic field, wouldn't that lead to a 1581/anti-1581 explosion?  ;)

          Truly,
          Robert Bernardo
          Fresno Commodore User Group
          http://videocam.net.au/fcug
          July 23-24 Commodore Vegas Expo v7 2011 - http://www.portcommodore.com/commvex

 



Back to top