From: john@acorn.co.uk (John Bowler) Subject: Re: Multiprocessing Archimedes?? Date: 16 Aug 91 11:10:50 GMT torq@GNU.AI.MIT.EDU (Andrew Mell) writes: >I notice that the Arm3 has a new instruction over the Arm2 which is >SWP. It swaps a byte or a word between register and external memory. >(uninterruptible between the read and write) ^^^^^^^^^^^^^^^ Indeed, but not necessarily not interleavable with other memory operations (sorry about the double negative :-). In particular, to fully support the SWP on a system with multiple memory bus masters the memory control logic which decides which bus master has access to the memory next would have to force an interlock between the memory read and memory write of the SWP instruction. Now, the ARM3 has a LOCK pin for this, but to support multi-processors you need to connect it to something :-). >All very interesting you might say, but it intrigues me as this sort >of instruction is usually only used in multiprocessor systems as a >software semaphore. > >Why did Acorn add this instruction to the Arm3? Because a long time ago, when we were very young (;-) we tried to write a multi-threaded OS (ARX) and we ``found'' (sic, thought) that it was spending a lot of time going into supervisor mode and disabling interrupts so that it could implement mutexes (for user mode code - including the OS, which ran in user mode too). In theory SWP allows user code to implement mutexes efficiently. As far as I am concerned the MP aspects of SWP are bonuses (clearly these were considered at the same time - or the LOCK pin wouldn't be there). Notice that SWP always bypasses the cache; again this is MP support, however there is an ommission here in that it is impossible to do a (reliable) read from external memory (you might get the cache contents instead!) John Bowler (jbowler@acorn.co.uk) From: john@acorn.co.uk (John Bowler) Subject: Re: Multiprocessing Archimedes?? Date: 19 Aug 91 16:25:33 GMT julian@bridge.welly.gen.nz writes: >john@acorn.co.uk (John Bowler) writes: > >> Notice that SWP always bypasses the cache; again this is MP support, however >> there is an ommission here in that it is impossible to do a (reliable) read >> from external memory (you might get the cache contents instead!) > >If you're using it to implement semaphores, this is not a problem, as you'd >never need to access the semaphore with any instruction other than SWP. Yes; there is no problem with the semaphore, but the semaphore must be protecting some state which is shared. When a processor has claimed that semaphor it probably needs to read the state and to obtain consistent results when it reads it. If the data is in cacheable memory the only way it can do that is to use sequences of the form:- SWP rx, rx, [raddr] ; read a value out STR rx, [raddr] ; and put it back... :-( The alternative is to allocate shared data in uncacheable memory. This requires some OS intervention (a user program cannot simply allocate shareable data structures out of its own heap unless the whole heap is uncacheable) and uncacheable data obviously has a performance hit. >BTW. You wouldn't happen to know the instruction format for SWP, by any > chance? If a software emulator can be written for it for ARM2 machines > (like the FPE - or even add it to the FPE) then we can all start using > it. RISC iX 1.2 emulates the SWP instruction on machines which do not support it. RISC OS doesn't. The assembler syntax is:- SWP{cond}{B} Rd, Rm, [Rn] the semantics (except for the cache behaviour and so on) are:- MOV , Rm LDR{cond}{B} Rd, [Rn] STR{cond}{B} , [Rn] (ie the SWP Rx, Rx, [Raddr] example above *does* store the *old* Rx value in [Raddr]... :-). The instruction format is:- bit 31 bit 0 c.o.n.d.0.0.0.1 0.B.0.0.n.n.n.n d.d.d.d.0.0.0.0 1.0.0.1.m.m.m.m c.o.n.d - the condition B - 0 = swap word 1 = swap byte n.n.n.n - Rn d.d.d.d - Rd m.m.m.m - Rm Data aborts (from the memory manager) leave Rd/Rm as they were before. SWP bypasses the ARM3 cache, although the write operation still updates the cache (if the address is cached). I don't know whether the read will cause the rest of that part of the cache to be updated (I assume not, and the programmer should not care :-) John Bowler (jbowler@acorn.co.uk) From: dseal@armltd.co.uk (David Seal) Subject: Re: ARM3 instructions. Date: 4 Sep 92 15:01:12 GMT In article <4422@gos.ukc.ac.uk> amsh1@ukc.ac.uk (Brian May#2) writes: > I don't have an Archie myself but have used them quite a lot in the past. >I was recently mucking about with a friend's A5000, trying to find the new >instructions that turned the cache on and off. I found them, they were >co-processor instructions with the processor itself as (I think) number 0. Coprocessor 15, in fact. > Anyway, as I was disassembling away I found a new instruction (well, I had >never come across it before). It was 'SWP' and I imagine it swaps registers >with registers, maybe with memory as well? I can't remember. If it does >reg<->mem as well, and is uninterruptable, perhaps it is for use as a >semaphore in multi-processor systems? The SWP instruction was new to the ARM2as macrocell. I believe ARM3 was the first full chip which contained it. More recent macrocells and chips like ARM6, ARM60, ARM600 and ARM610 also contain it. It only swaps a register with a memory location (either a byte or a word), and not two registers. It can however read the new contents of the memory location from one register, and write the old contents of the memory location to another register - i.e. it doesn't have to do a pure swap. This may be the source of your idea that it can swap two registers. It is indeed uninterruptable, and yes, it is intended for semaphores. > Of course I won't be the first person to notice this so I wondered, could >someone post some info on this, and also on the co-processor instructions >relevant to the CPU itself? The SWP instruction: Bits 31..28: Usual condition field Bits 27..23: 00010 Bit 22: 0 for a word swap, 1 for a byte swap Bits 21..20: 00 Bits 19..16: Base register (addresses the memory location involved) Bits 15..12: Destination register (where the old memory contents go) Bits 11..4: 00001001 Bits 3..0: Source register (where the new memory contents come from) Byte swaps use the bottom byte of the source and destination registers, and clear the top three bytes of the destination register. There are various rules about how R15 works in each register position, similar to those for LDR and STR instructions. The destination and source registers are allowed to be the same for a pure swap. I don't know offhand what would happen if the base register were equal to one or both of the others, but I don't think I'd recommend doing it! Assembler syntax is (using <> around optional sections): SWP Rdest,Rsrc,[Rbase] The ARM3 cache control registers are all coprocessor 15 registers, accessed by MRC and MCR instructions in non-user modes. (They will produce invalid operation traps in user mode.) Coprocessor 15 register 0 is read only and identifies the chip - e.g.: Bits 31..24: &41 - designer code for ARM Ltd. Bits 23..16: &56 - manufacturer code for VLSI Technology Inc. Bits 15..8: &03 - identifies chip as an ARM3. Bits 7..0: &00 - revision of chip. Coprocessor 15 register 1 is simply a write-sensitive location - writing any value to it flushes the cache. Coprocessor 15 register 2: a miscellaneous control register. Bit 0 turns the cache on (if 1) or off (if 0). Bit 1 determines whether user mode and non-user modes use the same address mapping. Bit 1 is 1 if they do, 0 if they have separate address mappings. It should be 1 for use with MEMC. Bit 2 is 0 for normal operation, 1 for a special "monitor mode" in which the processor is always run at memory speed and all addresses and data are put on the external pins, even if the memory request was satisfied by the cache. This allows external hardware like a logic analyser to trace the program properly. Other bits are reserved for future expansion. Code which is trying to set the whole control register (e.g. at system initialisation time) should write these bits as zeros to ensure compatibility with any such future expansions. Code which is just trying to change one or two bits (e.g. turn the cache on or off) should read this register, modify the bits concerned and write it back: this ensures that it won't have unexpected side effects in the future like turning as-yet-undefined features off. This register is reset to all zeros when the ARM3 is reset. Coprocessor 15 register 3: controls whether areas of memory are cacheable, in 2 megabyte chunks. All accesses to an uncacheable area of memory go to the real memory and not to the cache - this is a suitable setting e.g. for areas containing memory-mapped IO, or for doubly mapped areas of memory. Bit 0 is 1 if virtual addresses &0000000-&01FFFFF are cacheable, 0 if they are not. Bit 1 is 1 if virtual addresses &0200000-&03FFFFF are cacheable, 0 if they are not. : : Bit 31 is 1 if virtual addresses &3E00000-&3FFFFFF are cacheable, 0 if they are not. Coprocessor 15 register 4: controls whether areas of memory are updateable, in 2 megabyte chunks. All write accesses to a non-updateable area of memory go to the real memory only, not to the cache - this is a suitable setting for areas of memory that contain ROMs, for instance, since you don't want the cached values to be altered by an attempt to write to the ROM. (Or, as in MEMC, by an attempt to write to write-only locations that share an address with the read-only ROMs.) Bit 0 is 1 if virtual addresses &0000000-&01FFFFF are updateable, 0 if they are not. Bit 1 is 1 if virtual addresses &0200000-&03FFFFF are updateable, 0 if they are not. : : Bit 31 is 1 if virtual addresses &3E00000-&3FFFFFF are updateable, 0 if they are not. Coprocessor 15 register 5: controls whether areas of memory are disruptive, in 2 megabyte chunks. Any write access to a disruptive area of memory will cause the cache to be flushed. This is a suitable setting for areas of memory which if written, could cause cache contents to become invalid in some way. E.g. on MEMC, writing to the physically addressed memory at addresses &2000000-&2FFFFFF will also usually change a virtually addressed location's contents: if this location is in cache, a subsequent attempt to read it would read the old value. To avoid this problem, the physically addressed memory should be marked as disruptive in a MEMC system. Similarly, any remapping of memory on a MEMC or other memory controller should act disruptively, since the cache contents are liable to have become invalid. Bit 0 is 1 if virtual addresses &0000000-&01FFFFF are disruptive, 0 if they are not. Bit 1 is 1 if virtual addresses &0200000-&03FFFFF are disruptive, 0 if they are not. : : Bit 31 is 1 if virtual addresses &3E00000-&3FFFFFF are disruptive, 0 if they are not. Coprocessor 15 registers 3-5 are in an undefined state after power-up: they must be programmed correctly before the cache is turned on. Note that you should check the identity code in coprocessor 15 register 0 identifies the chip as an ARM3 before assuming that the other registers can be used as stated above, unless you are absolutely certain your code can only ever be run on an ARM3. Otherwise you are likely to run into problems with other chips - e.g. an ARM600 uses the same coprocessor 15 registers to control its cache and MMU, but in a completely different way. Just about the only thing they do have in common is that coprocessor 15 register 0 contains an identification code as described above. David Seal dseal@armltd.co.uk All opinions are mine only... From: mhardy@acorn.co.uk (Michael Hardy) Subject: Re: Risc-OS Documentation Date: 15 Aug 91 09:45:14 GMT Organization: Acorn Computers Ltd, Cambridge, England ARM3 SUPPORT ============ Introduction and Overview ========================= The ARM3Support module provides commands to control the use of the ARM3 processor's cache, where one is fitted to a machine. The module will immediately kill itself if you try to run it on a machine that only has an ARM2 processor fitted. Summary of facilities --------------------- * Commands are provided: one to configure whether or not the cache is enabled at a power-on or reset, and the other to independently turn the cache on or off. There is also a SWI to turn the cache on or off. A further SWI forces the cache to be flushed. Finally, there is also a set of SWIs that control how various areas of memory interact with the cache. The default setup is such that all RISC OS programs should run unchanged with the ARM3's cache enabled. Consequently, you are unlikely to need to use the SWIs (beyond, possibly, turning the cache on or off). Notes ----- A few poorly-written programs may not work correctly with ARM3 processors, because they make assumptions about processor timing or clock rates. Finding out more ---------------- For more details of the ARM3 processor, see the Acorn RISC Machine family Data Manual. VLSI Technology Inc. (1990) Prentice-Hall, Englewood Cliffs, NJ, USA: ISBN 0-13-781618-9. SWI Calls ========= Cache_Control (SWI &280) ======================== Turns the cache on or off On entry -------- R0 = EOR mask R1 = AND mask On exit ------- R0 = old state (0 => cacheing was disabled, 1 => cacheing was enabled) Interrupts ---------- Interrupts are disabled Fast interrupts are enabled Processor mode -------------- Processor is in SVC mode Re-entrancy ----------- Not defined Use --- This call turns the cache on or off. Bit 0 of the ARM3's control register 2 is altered by being masked with R1 and then exclusive ORd with R0: ie new value = ((old value AND R1) XOR R0). Bit 1 of the control register is also set, forcing the memory controller to use the same translation table for both User and Supervisor Modes (as indeed the MEMC chip should). Other bits of the control register are set to zero. Related SWIs ------------ None Related vectors --------------- None Cache_Cacheable (SWI &281) ========================== Controls which areas of memory may be cached On entry -------- R0 = EOR mask R1 = AND mask On exit ------- R0 = old value (bit n set => 2MBytes starting at n*2MBytes are cacheable) Interrupts ---------- Interrupts are disabled Fast interrupts are enabled Processor mode -------------- Processor is in SVC mode Re-entrancy ----------- Not defined Use --- This call controls which areas of memory may be cached (ie are cacheable). The ARM3's control register 3 is altered by being masked with R1 and then exclusive ORd with R0: ie new value = ((old value AND R1) XOR R0). If bit n of the control register is set, the 2MBytes starting at n*2MBytes are cacheable. The default value stored is &FC007FFF, so ROM, the RAM disc and logical non-screen RAM are cacheable, but I/O space, physical memory and logical screen memory are not. (You may find a value of &FC007CFF - which disables cacheing the RAM disc - gives better performance.) Related SWIs ------------ Cache_Updateable (SWI &282), Cache_Disruptive (SWI &283) Related vectors --------------- None Cache_Updateable (SWI &282) =========================== Controls which areas of memory will be automatically updated in the cache On entry -------- R0 = EOR mask R1 = AND mask On exit ------- R0 = old value (bit n set => 2MBytes starting at n*2MBytes are cacheable) Interrupts ---------- Interrupts are disabled Fast interrupts are enabled Processor mode -------------- Processor is in SVC mode Re-entrancy ----------- Not defined Use --- This call controls which areas of memory will be automatically updated in the cache when the processor writes to that area (ie are updateable). The ARM3's control register 4 is altered by being masked with R1 and then exclusive ORd with R0: ie new value = ((old value AND R1) XOR R0). If bit n of the control register is set, the 2MBytes starting at n*2MBytes are updateable. The default value stored is &00007FFF, so logical non-screen RAM is updateable, but ROM/CAM/DAG, I/O space, physical memory and logical screen memory are not. Related SWIs ------------ Cache_Cacheable (SWI &281), Cache_Disruptive (SWI &283) Related vectors --------------- None Cache_Disruptive (SWI &283) =========================== Controls which areas of memory cause automatic flushing of the cache on a write On entry -------- R0 = EOR mask R1 = AND mask On exit ------- R0 = old value (bit n set => 2MBytes starting at n*2MBytes are disruptive) Interrupts ---------- Interrupts are disabled Fast interrupts are enabled Processor mode -------------- Processor is in SVC mode Re-entrancy ----------- Not defined Use --- This call controls which areas of memory cause automatic flushing of the cache when the processor writes to that area (ie are disruptive). The ARM3's control register 5 is altered by being masked with R1 and then exclusive ORd with R0: ie new value = ((old value AND R1) XOR R0). If bit n of the control register is set, the 2MBytes starting at n*2MBytes are updateable. The default value stored is &F0000000, so the CAM map is disruptive, but ROM/DAG, I/O space, physical memory and logical memory are not. This causes automatic flushing whenever MEMC's page mapping is altered, which allows programs written for the ARM2 (including RISC OS itself) to run unaltered, but at the expense of unnecessary flushing on page swaps. Related SWIs ------------ Cache_Cacheable (SWI &281), Cache_Updateable (SWI &282) Related vectors --------------- None Cache_Flush (SWI &284) ====================== Flushes the cache On entry -------- - On exit ------- - Interrupts ---------- Interrupts are disabled Fast interrupts are enabled Processor mode -------------- Processor is in SVC mode Re-entrancy ----------- Not defined Use --- This call flushes the cache by writing to the ARM3's control register 1. Related SWIs ------------ None Related vectors --------------- None * Commands ========== *Cache ====== Turns the cache on or off, or gives the cache's current state Syntax ------ *Cache [On|Off] Parameters ---------- On or Off Use --- *Cache turns the cache on or off. With no parameter, it gives the cache's current state. Example ------- *Cache Off Related commands ---------------- *Configure Cache Related SWIs ------------ Cache_Control (SWI &280) Related vectors --------------- None *Configure Cache ================ Sets the configured cache state to be on or off Syntax ------ *Configure Cache On|Off Parameters ---------- On or Off Use --- *Configure Cache sets the configured cache state to be on or off. Example ------- *Configure Cache On Related commands ---------------- *Cache Related SWIs ------------ Cache_Control (SWI &280) Related vectors --------------- None ****************************************************************************** I hope this helps. - Michael J Hardy Email: mhardy@acorn.co.uk Acorn Computers Ltd Telephone: +44 223 214411 Cambridge TechnoPark Fax: +44 223 214382 645 Newmarket Road Telex: 81152 ACNNMR G Cambridge CB5 8PB England Disclaimer: All opinions are my own, not Acorn's From: osmith@acorn.co.uk (Owen Smith) Subject: Re: Risc-OS Documentation Date: 13 Aug 91 15:06:19 GMT The ARM3 SWIs really aren't all that interesting, and I've just totally failed to find a documentation file for them. However, as a tester, here is a bit of BASIC (courtesy of Brian Brunswick) which marks the RAM disk area as not cacheable. This in fact makes it go faster. SYS "Cache_Cacheable", 0, &fffffcff SYS "Cache_Updateable", 0, &fffffcff The reason it goes faster is that because such large amounts of data are being slurped around, the memory copy loop tends to get flushed out of the cache, particularly since it is a long piece of loop unrolled code (for speed on an ARM2). So you end up with a cache full of data, very little of which is ever accessed again before it gets flushed out of the cache by some more data. The loop does an LDM and STM 10 registers at a time in RamFS, so in theory there are two words that get cached (ARM3 read 4 words at a time), but this saving is swallowed up by the cache synchronisation delays. You have to be careful though. Brian has his own re-sizing ram disk which uses the system sprite area. Marking the system sprite are as not cacheable makes it go slower. We (Brian and I) think this is because he uses the C function memcpy(), in which the LDM and STM is 4 registers at a time. Since this is a multiple of four, it hits the ARM bug where it loads 5 words and then throws the fifth one away, which results in loading 8 words on an ARM3 (it always reads 4 word chunks even with the cache off). So with the cache off, you load 8 then throw 4 away, load the next 8 (including the 4 you just threw away) and throw 4 away etc. So you are effectively reading all the data twice. With the cache on this goes down to once. Yes the code will probably get flushed out, but it is a tight loop (not unrolled) so it is not very likely and the cost of reloading the code is less than the saving on the data loads. The moral of this is to be careful with the ARM3 SWIs, and don't just think that it ought to go faster, do timings, in lots of different screen modes. Owen.