Contents | < Browse | Browse >

%% Macintosh vs. MS-DOS Performance              By Sandberg and Hembree %%

[Editor's Note:  This was sent to me on the Internet, so I don't really
know who the authors are, I got their names from the article itself,
because I like to try to give credit whenever possible.  Anyway, while
this is a Mac vs. PC comparison, it's more of a Motorola vs. Intel comparo,
so it was suggested that it could apply to the Amiga.  It's very inform-
ative, and very worth reading if you enjoy technical stuff!]

We have for some time seen claims made (primarily by MS-DOS sympathizers)
that the Apple Macintosh provides inferior performance when compared to
MS-DOS ISA/EISA/MCA.  The points made are usually like Jim Seymour's
claims that "On the price side of that equation, at every moment since its
introduction six years ago, the Mac has delivered less raw computing
performance at any given price level than a wide variety of comparable
MS-DOS machines" and "the raw power of 25 MHz and 33 MHz 386's and 486's
combined with the interprocess communication in OS/2, make a DOS machine a
far more powerful platform".  We can understand where a claim like this
comes from - there have been virtually no realistic MS-DOS vs Macintosh
benchmarks run!  Byte's benchmark suit comes with the disclaimer that you
cannot use it to compare the two machines and we have not been able to
find a reasonable 3rd party benchmark doing so. We will disprove such
claims then, not by use of extremely questionable benchmarks (one we found
had the Mac IIci 1,400 times faster than an unspecified 80386 machine,
clearly unreasonable), but by a careful architectural analysis of the two
computer families.  We have studied the two families for a considerable
time and present our data and conclusions below.  One of us (Sandberg)
holds both a BSEE and an MSEE in computer engineering and has designed
high-performance image-processing products for the ISA bus, as well as
several software projects for the Macintosh.  The other (Hembree) holds a
BS in Computer Science with 91 hours of graduate study divided between CS,
EE (digital and IC design) and Math, as well as hardware and software
product development experience on both Mac and ISA systems.

Processor Family Architectures

Let's begin with the fundamentals of the machines.  Let's go inside the
the architecture of the Mac and MS-DOS machines and see if these claims
can be derived from this most fundamental level.  In particular, we will
begin with the beginning, the Motorola 680x0 (68k) and the Intel 80x86/88
('86) CPU's.  We will consider primarily those features of the processors
available to programmers on the generalized Mac and MS-DOS platforms, not
features which require hardware mods or unpopular alternate operating
systems such as A/UX or OS/2.


The first key to evaluating the potential power of a processor
is its registers.  The 68k family all have 8 address and 8 data registers
of 32 bits each.  The 68k family actually has two or three stack pointer
registers but this feature is not used in the Mac and is not included in
this analysis.  The '86 family have only 8 registers, 16 bits in the '286
and earlier and 32-bit in the later versions, all usable as data registers
and some as addressing registers.  Both families have additional registers
for status, PC, and control registers for special features (cache, memory
management, etc.) which are not used except in OS programming (discussed
below).  All 68k CPU's therefore have (8 + 8) x 32 = 512 bits of user
register while the '86ers have either 8 x 16 = 128 bits (286 and earlier)
or 8 x 32 = 256 bits of user registers.  We leave off discussion of the
'86 family's so-called segment registers for later. 

Instruction Sets

The instruction sets of the two machines both cover all of the standard
operations but with differing emphasis.  The 68k family adds bit
manipulation instructions and, in the '020 and later processors, bit field
instructions.  There are no comparable instructions in the '86 family,
which requires sequences of mask and shift instructions to do the same
tasks (these tasks are important in both graphics and in many efficient
data structures).  The '86 family supports both packed and unpacked BCD
operations including multiply and divide (but only using the accumulator
register AL) while the 68k family allows only packed BCD addition and
subtraction, but supports direct memory-to-memory operations with limited
addressing options.  This feature is generally of interest only to COBOL
compiler writers - and may help explain why COBOL compilers for the Mac
are extinct.  No 68k instructions except subroutine call and return (which
implicitly use the A7 register as the stack pointer) require specific
registers - i.e., if a count or bit number register is called for, any of
the 8 data registers can be used.

One feature of the '86 instruction set is the ability to override a few of
the implicit parameters by using prefix bytes.  The two most common uses
of prefix bytes are to override default segment registers (this is the
only way, except in string instructions, to use the ES segment register)
and to cause the "string" group instructions to repeat either
unconditionally or conditionally.  The third type of use is available only
on '386 and '486 CPU's and specifies 32-bit data and 32-bit addressing
offsets are to be used instead of the normal 16-bit data and 16-bit
offsets.  Thus, a '386 MOVES instruction could have 4 prefix bytes,
overriding the source segment register, selecting 32-bit operand size,
32-bit SI and DI register offsets and a repeat prefix.  A longstanding
problem with, in particular, the '86 string instructions has been that the
CPU, necessarily, allowed interrupts in the middle of long string moves
and compares and did not save the complete processor state.  In the '286
and earlier processors (we have not checked for the later ones) Intel's
manuals warn that only the last prefix byte in a multi-prefixed
instruction is saved during an interrupt and that this can cause improper
operation under some common circumstances.

In addition to being usable as generalized data holders, most '86
"general-purpose" registers are implicit parameters in a variety of
instructions such as variable-amount shifts (the CL register), and every
multiply or divide (which use the AX register and also the DX for the
largest operands).  As an example, the 68k family has a generalized (all
sizes and addressing modes allowed) memory-memory move instruction while
the '86 family uses "string" instructions.  The fact that 7 of the 8
general-purpose registers are also implicit parameters in various
instructions in the '86 family places compiler writers in a particular
bind.  They must choose between not using any registers (absolutely
destroying performance), trading off registers for instruction use (e.g.,
if the translate instruction is not used, the BX register becomes
generally available) or putting register save and restore instructions
around the instructions making use of the implicit registers.  This last
option is the one most often used but can be a fairly tricky one.  The
compiler must evaluate each individual occasion to determine whether the
overhead of setting up for the special instruction exceeds the execution
time of multi-instruction equivalent code.  Although the compiler can
determine which registers actually need saving, and hence the overhead
associated with the save-restore template, the time trade-offs are often
dependent on the repeat count and, if the count is a variable, the
compiler cannot determine which method is optimal and must make an
arbitrary choice (one which may never be optimal in a given use). 

Memory Models and Accesses

Another important aspect of processor power is how the CPU accesses data
(how easy is it to describe where the data is and get it).  This area
includes the processor's addressing modes, memory model, physical memory
size, and memory access speed.  Remember that we are not considering
special 680x0 or 80x86 features for x > 0.

For addressing modes, both processor families offer register direct,
immediate, indirect, offset, dual-register with offset and direct (or
absolute).  Where registers are used as addresses or hold parts of a
computed address, the comparison becomes much more complex.  The '86
family allows any of 4 registers to be used in indirect addressing, the
68k any of 8.  For offset addressing, the 68k family allows a 16 bit
offset from any of 8 address registers, the '86's allow either 8 or 16 bit
offsets from any of 4 registers.  In dual-register with offset addressing,
the '86 family allows either of 2 registers to be added to either of 2
other registers (giving a total of 4 combinations) and either an 8 or a 16
bit offset while the 68k family allows any of 8 address registers (or the
program counter, in the case of a source operand) to be added to any of
the other 15 registers (considered as either 16 or 32 bit signed values,
giving a total of 240 or 270 possible combinations) plus an 8 bit offset.
In the Mac absolute addressing can only be used to access a limited pool
of shared system variables, every other part of the Mac system must be
position independent and may be located anywhere in memory.

The 68k family also has predecrement and postincrement modes, which use
any of the address registers.  The only similar usage in the '86 family is
found in implicit addressing modes in the string and stack operate
instruction groups.  In the string group, the SI and DI registers are
implicit and in the stack group, the stack pointer register is implicit.
In short, we see that the '86 family's addressing modes are a proper
subset of the 68k family's modes and that the '86 family allows use of
only half as many registers in the modes which use registers.

As for memory models, the 68k family uses a simple large linear address
space which is broken on the Mac into RAM, ROM, and I/O devices (with
minor complications for NuBus).  The '86 family uses two address spaces,
I/O and memory.  I/O is a single 64k address space, accessed only thru
special instructions.  Memory is accessed as a series of 64k segments,
requiring segment registers to specify which segment is currently being
accessed.  It is these segment registers which cause larger programs
difficulty.  Different registers used in addressing use different default
segment registers, or these may be overridden with prefix bytes.  In the
general case, though, this means that access to an arbitrary memory
location requires that a segment register first be loaded, then the access
performed.  The property of locality may reduce the number of segment
register loads needed, but often at a cost in compiler complexity and/or
run-time checking overhead.  There is no way of completely avoiding the
fundamental problems in a memory model which always must always use one of
several auxiliary registers to determine a physical address.

Another factor in the evaluation of the benefits of addressing modes is
the cost in time of using a particular addressing mode, measured in clock
cycles.  Here, the individual members of the two families differ in the
amount of time needed, and the number of clock cycles needed to access a
particular address (register direct imposes no access penalty on these

The 8086/88 CPU's use a physical access cycle of 4 clock cycles, which
drops down two cycles in the latest family members. A perusal of Intel
manuals shows that each memory data reference for most data manipulation
instructions adds either 6 clock cycles (for a source operand) or 13 clock
cycles (for a destination operand) plus an additional effective address
(EA) calculation time.  This EA calculation time is from 5 to 12
additional clock cycles, with two more clock cycles needed if the default
segment register is overridden.  Thus, an add of a byte or word register
to memory takes 16 + (5 to 12) + (0 or 2) clock cycles or 21 to 30 clock
cycles to execute, if the instruction has been prefetched (else add 4 or 8
clock cycles).  Later members of the family drop this down to a minimum of
7 clock cycles, a very substantial improvement.  This figure does,
however, assume the instruction has been prefetched with no segment
register override done.  Clock speeds in the '86 family range up to 33 MHz
for the fastest 80386 parts.  Caches, for both instructions and data,
appear only in the latest family member, the 80486.

The 68000 used 4 clock cycle reads and 5 clock cycle writes, also dropping
to 2 cycles (for both read and write) in the latest family members.
Motorola manuals state that, for the same class of instructions as in the
Intel example above, a 68000 takes 4 clock cycles (memory source operand)
or 8 clock cycles (memory destination operand) plus an EA calculation time
of from 4 to 10 clock cycles.  Thus, the 68000 takes 8 + (4 to 10) or 12
to 18 clock cycles to add a byte or word register to memory.  Later
members of the 68k family also improve the clock cycle performance on this
instruction, down to an optimal 5 to 7 clock cycles (differing according
to EA calculation times).  This optimal case assumes only an instruction
cache hit, and would be faster in the case of a 68030/040 data cache hit.
Clock speeds for the 68k processor family range up to 50 MHz for the
fastest 68030 parts.  Instruction and data caches are included in the
latest two generations of CPU's, the 68030 and 68040.

In summary, the Motorola processors are superior to the Intel processors
in terms of instruction set, addressing modes, memory model, execution
clock-cycle timing, and fastest clock speeds.  In no sense, then, is the
performance of the Intel CPUs up to that of their Motorola counterparts.
The fact that high-performance workstation designers, have consistently
chosen 68k family CPU's rather than '86 family CPU's may be taken as
confirmation of this evaluation.  Is there, then, a system-level
implementation difference to account for the claimed Mac performance

System-Level Hardware


The Macintosh family is simpler to analyze, since all of the systems are
manufactured by Apple Computer, Inc.  The Mac Plus and SE use 68000 CPU's
running at just under 8 MHz with zero wait-state memory.  Their SCSI
(high-speed peripheral) ports operate at 350 and 700 kilobytes per second,
respectively.  The Mac II uses a 68020 CPU with 68881 floating point
coprocessor (FPU), running at about 15.8 MHz, with two wait-state memory.
The SCSI port operates at a maximum data transfer speed of about 1.2
megabytes per second.  Accesses to NuBus boards take about 800 to 1000
nanoseconds (ns), with boards that require two NuBus wait cycles (200 ns).
The Mac IIx, IIcx, and SE/30 each use a 68030 CPU with 68882 FPU at the
clock speed of the Mac II with the same memory and SCSI speeds and, for
the IIx and IIcx, the same NuBus access speeds, with the SE/30's speed of
access to its direct slot being dependent only on the speed of the add-in
card.  The Mac IIci runs a 68030 CPU at 25 MHz, with support for 80 ns
DRAMs, a slot for an external cache memory card, support for burst-fill
mode, and somewhat faster NuBus cycle times.  For maximum performance,
Macintoshes must be upgraded with processor accelerators.  The highest
performance of these is the Daystar 50 MHz 68030 accelerator.  This
replaces the CPU in a Mac II family system and adds 32K of zero wait-state
burst-fill cache memory.  Additionally, a private high-performance bus can
connect the Daystar accelerator to video cards, SCSI cards, and memory


The IBM/Compaq MS-DOS systems currently being sold use
8086/286/386/386SX/486 processors.  The processor speeds range from 8 MHz
to 33 MHz, with zero or 32k bytes of built-in cache and no cache
expandability.  Memory speeds run from zero to three wait states,
generally with more wait states on the faster processors.  ISA bus speeds
run from 1.2 to 8 million bytes per second (although this later figure
appears to be for DMA operations only).  EISA systems are reported to have
twice the data bandwidth, again for DMA primarily and slower for random
memory accesses.  MCA systems from IBM have high burst data rates, near
the best performance of EISA, but degrade even more rapidly in random
access operations.  All systems support DMA and can use disk controllers
whose performance is primarily limited by the disk head data rate.
Caching disk controllers in some models also greatly enhance disk

In general, top-end MS-DOS systems are engineered to be CPU-limited, not
memory-speed limited as the top-end Mac's are.  Unfortunately for the
MS-DOS folks, their CPU's are much less capable than the Mac's, to the
point that we have seen a 20 MHz (no wait state) 386 system outperformed
four to one on some non-floating-point tasks by a Mac Plus using
comparable quality commercial programs for each system.  A baseline IBM AT
(8 MHz) took over ten times as long as the plus for this graphics test.
We believe that most of the claimed drawing-speed advantages of MS-DOS
systems come from (expensive) special-purpose graphics co-processor
boards, usable only from a few programs such as Autocad.  In price, too,
the high-end IBM and Compaq machines are substantially more expensive than
the top-end Macintoshes, actually by enough to more than pay for a
top-of-the-line 50 MHz 68030 accelerator for the Mac.

Software Issues

Therefore, if the MS-DOS worlds claims of superior performance is correct,
the Mac software, system or application level or both, must be terribly
flawed.  As we have already seen, the 68k family provides a large number
of addressing modes which, used in conjunction with the large register
set, allows the use of complex data structures without imposing a
performance penalty.  Because of this, the Macintosh toolbox has from the
start been designed according to good object-oriented programming (OOP)
principles.  The best example of this is the Dialog Manager, which is
clearly two subclasses (dialogs and alerts) of the Window Manager, which
is in turn a subclass of QuickDraw.  Now that language support for OOP is
available, the toolbox fits even more cleanly into applications.  At the
OS level, the Macintosh Hierarchical File System (HFS) is clearly superior
to both DOS and OS/2, even the so-called RHigh-Performance File System.S
This superiority is in terms of both functionality and efficiency.  HFS is
simply faster, seldom requiring more than two disk accesses to read or
write a file block (and usually just one).  Subdirectories were implicit
in the original Macintosh File System (MFS), even though it was a flat
file system.  This meant that users (and developers) did not have to throw
out existing programs, or learn arcane CONFIG.SYS pathing protocols to use

This leads to another area where the Mac has a decided edge over the
MS-DOS world.  Apple sets the standards and developers follows them.
Those who donUt, die in the marketplace when their products break. Because
there is a single graphics standard in the Mac (Quickdraw) which is device
independent, everyone goes thru Quickdraw to image to both the screen and
to printers.  With the exception of a dwindling number of game
programmers, Mac developers are a remarkably well-behaved bunch.  We have
determined by measurement that there are few performance gains to be had
by bypassing the system and toolbox calls.  This stands to reason since
Apple spends roughly $400 million a year on research and development.
This is a large enough amount that it would be surprising if the system
and toolbox routines and data structures werenUt as close to optimal
performers as can reasonably be designed.  Although bad application design
can ruin the performance of any system, the top performers in the Mac
world are truly stellar.  Such applications as WingZ, WriteNow, and Think
C 4.0 are examples of how fast programs can run on the Macintosh.

 In the MS-DOS world, Microsoft sets the standards, and the developers
Rwork aroundS them.  Bypassing DOS, or even the BIOS, is routine for the
rocket scientists of the MS-DOS developer community, in the name of
performance.  The reason for this is that MS-DOS began life as QD-DOS, a
RQuick and DirtyS clone of the CPM operating system for 8086 systems.
What this means in reality is that MS-DOS began life as a collection of
routines and data structures, Rflying in loose formation,S rather than as
a carefully integrated system.  Microsoft has had to work long and hard
just to get MS-DOS cleaned up and give it a rough approximation of the Mac
OS (but not toolbox) capabilities.  In the process, Microsoft broke most
existing applications.  Few DOS 1.x (or even 2.x) applications will run in
DOS 3.3 if they took the normal and expected shortcuts of most
high-performance commercial applications.  The only reasons for an MS-DOS
application to beat a comparable Macintosh application are in floating
point, if the Macintosh being evaluated does not have an FPU, or in a
text-based interface, where the MS-DOS application just writes a byte per
character but the Mac has to draw the character (a much more complex
task).  With Microsoft now proclaiming - belatedly but correctly - that
GUIUs (Graphical User Interfaces) are the wave of the future, even this
(hardware-based) advantage will be lost to the MS-DOS world.  And after
all, graphing and drawing programs simply canUt use simple text-only
display hardware.


In conclusion, without resorting to dubious benchmarks, we see that the
Mac architecture enjoys a fundamental advantage over the MS-DOS
architecture at all levels from basic hardware thru system software and
application software.  We are quite unable to understand the various
columnists' insistence that comparing generationally equivalent Mac and
MS-DOS machines always favors the MS-DOS machines when, as we have shown,
the truth is otherwise.  Indeed, since the current generation of Macintosh
hardware offerings do not wring the maximum performance out of the
Motorola 680x0 CPUUs, we can look forward to further improvements in Mac
performance.  MS-DOS machines are already pushed so near to the limits of
their hardware performance potential that only enhanced CPUUs will improve
their relatively laggardly performance.

We are of the opinion that most or all of the so-called benchmarks which
are bandied about are inadequate.  A reasonable benchmark would operate on
hundreds of thousands of bytes of data, and would place more emphasis on
integer that floating point calculations, but would perform both.
Further, the benchmark should not be concerned with any sort of I/O (which
can always be sped up by throwing faster hardware into the box).  For a
raw system performance rating, the benchmark should be coded in assembler,
not C or some other higher level language (HLL).  This is because such HLL
benchmarks are usually more a measure of the quality of the compilerUs
code generator.  Common code benchmarks, where the same HLL code is
executed on all of the systems being tested, are particularly susceptible
to another flaw, biased coding.  By biased coding, we mean writing the
code such that the compilers would fully exploit one processorUs
architecture but not anotherUs.  An example of this would be not using
register variables in C, or using only 1 or 2.  Such code would use
everything an MS-DOS CPU has but leave most of the registers of a 680x0

This is what we have observed with the Byte benchmarks.  Indeed, Byte went
further and RportedS a tiny-C compiler from a Z-80 to both systems.  Since
the Z-80 register set maps directly onto the 80x86 register set, this is a
fairly optimal fit and uses most of the resources of the 80x86 CPU, while
keeping everything in a single segment to avoid the 64 kbyte segment
limits of MS-DOS.  The 680x0 version of their tiny-C, on the other hand,
uses less than half the CPUUs registers, and those inefficiently (i.e.,
only the data registers and ignoring most of the addressing modes of the
68000).  If Byte were serious about its benchmarks, it would declare
everything registerized and pointerize the array accesses (like any good
commercial developer), then compare its brain-dead-C results to MPW C,
THINK C, and appropriate C compilers from the MS-DOS world.  We are of the
opinion that the results would be very interesting, but embarrassing to