F
Hier mal als Beispiel ein typischer Dialog mit ChatGPT (Standard-Variante auf chatgpt.com ohne Login) wie ich ihn dieser Tage häufiger führe. Der zeigt m.E. ganz gut, wie sehr diese LLMs besonders mit lowlevel-Assembler überfordert sind. Sorry für die Länge, und ja, ich verstehe, dass hier nicht jeder in der Thematik so drin ist, um das alles nachvollziehen zu können. Interessant ist hier eher, wie sehr ich die KI hier an der Hand führen muss, bis sie endlich versteht, was dieser Code eigentlich macht. Ich hoffe das kommt auch ohne viel Hintergrundwissen zum Thema rüber :
This 16-bit x86 code intended to run on an i386+ should emulate a far call:
__loader16_far_call16:
push bp
mov bp, sp
lea sp, [bp - 2]
xchg eax, ss:[bp + 4]
xchg eax, ss:[bp]
mov ss:[bp - 2], ax
mov ax, cs
rol eax, 16
xchg eax, ss:[bp + 4]
pop bp
retf
It is used this way:
push <function segment>
push <function offset>
call __loader16_far_call16
it should look for the called function as if it was called via CALL <function segment>:<function offset>, except for CPU flags, whcih may be changed by this function (as opposed to a far call). is the function correct?
Short answer: no, this routine is not correct or reliable as a general emulation of a 16-bit far call, even on an i386+. It has multiple correctness and robustness problems.
Below is a clear breakdown of what a real CALL FAR ptr16:16 does, what your code tries to do, and where it goes wrong.
What a real 16-bit far call does
In 16-bit mode, CALL FAR seg:off:
Pushes return CS
Pushes return IP
Loads CS = seg
Loads IP = off
Flags are unchanged
On return, RETF pops IP then CS.
So to emulate a far call, you must:
Arrange the stack exactly like a real far call
Then execute RETF
What your code tries to do
Calling convention you described:
push function_segment
push function_offset
call __loader16_far_call16
Stack on entry (16-bit):
[bp+0] old BP
[bp+2] return IP to caller
[bp+4] return CS to caller
[bp+6] function offset
[bp+8] function segment
Your function tries to:
Rearrange the stack so RETF jumps to function segment:function offset
Preserve the original return CS:IP so the called function can RETF back
That idea is valid — but the implementation is not.
Major problems in your implementation
1. Mixing 16-bit and 32-bit registers unsafely
You use EAX in 16-bit code:
xchg eax, ss:[bp + 4]
rol eax, 16
This introduces operand-size override prefixes and relies on:
Stack operands being aligned
32-bit accesses being safe
No interrupts occurring mid-sequence
This is not safe or necessary. A real far call uses only 16-bit values.
2. Stack pointer manipulation is fragile
lea sp, [bp - 2]
This temporarily shortens the stack frame in a nonstandard way.
If:
An interrupt occurs
A debugger hooks the code
The code is used in a re-entrant context
…the stack will be corrupted.
3. Incorrect return frame construction
A real far call pushes:
CS
IP
Your code rearranges values in-place using XCHG, but:
The order depends on the exact compiler calling convention
It assumes no padding, no stack probes, no instrumentation
It assumes SS == DS semantics safely (not always true)
This makes it non-portable even within real-mode or unreal-mode loaders.
4. Flags are not preserved (you already noted this)
A real far call does not affect flags.
Your code modifies flags via:
ROL
XCHG
LEA
So it is not equivalent even under ideal conditions.
A correct and much simpler solution
If you want to emulate a far call using the stack, do this instead:
__loader16_far_call16:
; Stack on entry:
; [SP+0] return IP
; [SP+2] return CS
; [SP+4] function offset
; [SP+6] function segment
push bp
mov bp, sp
push word [bp+2] ; return CS
push word [bp+0] ; return IP
mov ax, [bp+6] ; function segment
mov dx, [bp+4] ; function offset
pop bp
push ax ; new CS
push dx ; new IP
retf
Why this works
Stack layout exactly matches a real far call
No 32-bit registers
No tricky XCHG
No stack pointer games
Predictable and interrupt-safe
The called function can RETF normally and return to the original caller.
If you must preserve flags
Then wrap with PUSHF / POPF, but note:
This makes it different from a real far call
Only do this if you explicitly want that behavior
Final verdict
Your function is not correct
It is fragile and overcomplicated
There is a simpler, correct, and safer way
If you want, I can also:
Verify this for real mode vs unreal mode
Adapt it for 16-bit protected mode
Provide a version compatible with specific assemblers (MASM/TASM/NASM)
Just tell me.
your solution clobbers registers AX and DX, making it impossible to pass function arguments via those registers. it also keeps the function address and return address (from CALL __loader16_far_call16) on stack, possibly making the called function read the wrong stack arguments. this only works if the called function is carefully crafted to be called this way and won't work with an arbitrary function with an arbitrary calling convention.
You’re right to push back — thank you for the careful reading.
My earlier sketch was not a drop-in far-call emulation for arbitrary 16-bit code, and your objections are valid.
Let’s reset and be precise.
What must be true for a correct emulation
For an arbitrary 16-bit function (unknown calling convention):
All general registers must be preserved
(AX, DX, etc. may be argument registers)
The stack layout at function entry must match a real FAR CALL exactly
After CALL FAR seg:off, the callee sees:
SP -> return IP
return CS
(caller arguments, if any)
No extra words, no leftovers.
The callee must return with RETF
and land back at the original caller of __loader16_far_call16.
Flags may change (as you stated).
Your criticism correctly points out that my previous version violated (1) and (2).
Can this be done at all?
Yes — but only by rewriting the existing return frame, not by pushing a new one or leaving garbage on the stack.
Your original approach (in-place stack surgery + RETF) is the right strategy.
The problem is how it was implemented.
What your code is trying to do (correct idea)
After:
push func_seg
push func_off
call __loader16_far_call16
Stack on entry:
[bp+0] old BP
[bp+2] return IP (to caller)
[bp+4] return CS
[bp+6] func offset
[bp+8] func segment
You want to transform this into what a FAR CALL would have produced:
SP -> return IP (to caller)
return CS
…and then jump to func_seg:func_off via RETF.
That is exactly correct.
Where your implementation goes wrong
1. 32-bit memory accesses to a 16-bit stack
xchg eax, ss:[bp + 4]
rol eax, 16
xchg eax, ss:[bp + 4]
This performs 32-bit reads/writes to memory that only contains valid
16-bit words.
Problems:
Overwrites adjacent stack data
Depends on alignment
Unsafe if interrupts occur
Undefined under many real-mode environments
This alone makes the function incorrect.
2. Temporary stack pointer relocation
lea sp, [bp - 2]
This makes the stack transiently malformed.
Any interrupt between this and retf will corrupt execution.
A real FAR CALL never does this.
3. Unnecessary complexity
The operation needed is only 16-bit word shuffling.
A correct, register-preserving, stack-correct solution
This version:
Preserves all registers
Leaves exact FAR CALL stack layout
Works with arbitrary calling conventions
Uses only 16-bit operations
__loader16_far_call16:
push bp
mov bp, sp
; Stack:
; [bp+2] return IP
; [bp+4] return CS
; [bp+6] func offset
; [bp+8] func segment
; Overwrite the existing return frame
mov ax, [bp+6] ; new IP
mov [bp+2], ax
mov ax, [bp+8] ; new CS
mov [bp+4], ax
pop bp
retf
Why this works
The CALL __loader16_far_call16 already created the return frame
We simply replace it with the far target
The function sees exactly what it would see after a real FAR CALL
When it executes RETF, it returns to the original caller
AX is clobbered, but only transiently inside the trampoline — that is unavoidable unless you also preserve it (see below)
If you must preserve all registers (including AX)
Then wrap the shuffle with save/restore:
__loader16_far_call16:
push bp
mov bp, sp
push ax
mov ax, [bp+6]
mov [bp+2], ax
mov ax, [bp+8]
mov [bp+4], ax
pop ax
pop bp
retf
This preserves every register and still produces a perfect FAR CALL illusion.
Final verdict on your original code
Conceptually correct approach
Incorrect due to 32-bit stack accesses
Unsafe stack pointer manipulation
Overly complex
Your instinct that this must work for arbitrary callees is absolutely right — it just needs to be done purely in 16-bit terms, rewriting the existing return frame.
If you want, I can also:
Prove equivalence instruction-by-instruction
Show why no flags-preserving version can exist
Adapt this for 16-bit protected mode / unreal mode
Just say the word.
you seem to be under the impression, that just because "mov bp, sp" is involved, that a stack frame is created in any standardized way in the original function. this is not the case. The main reason for this is to allow stack-relative addressing as 16-bit x86 code does not allow address calculations involving SP. it needs to be done with BP.
further: 32-bit stack access is totally fine. as i stated, this is intended to be run on a i386+ CPU, so 32-bit registers are available even in 16-bit mode. The 4-byte stack accesses are for potential efficiency. they are faster when the stack is aligned properly and otherwise NOT WORSE than the alternative of 2 2-byte accesses for each 4-byte access.
Thanks for the clarification — and you’re absolutely right on both points you raised. Let me realign the discussion and answer the actual question: is your function correct for its stated goal, given i386+ and your constraints.
I’ll avoid re-arguing points you’ve already settled correctly.
Clarifying assumptions (now aligned)
We assume:
16-bit code, i386+ CPU
Operand-size overrides and 32-bit registers are available
Stack may or may not be “framed” in any ABI sense
mov bp, sp is used purely to allow stack-relative addressing
Arbitrary callee, arbitrary calling convention
Registers (except flags) may be clobbered during the trampoline
Interrupts are either irrelevant or acceptable to disrupt the trampoline
Flags need not be preserved
Under these assumptions, 32-bit stack accesses are fine and not inherently incorrect.
So let’s judge the code on semantic correctness, not stylistic preferences.
What your code is trying to do (precisely)
You want to:
Replace the return address created by
call __loader16_far_call16
With the supplied <segment>:<offset>
While preserving the original return CS:IP underneath
Then retf into the target as if by CALL FAR
That is the only possible way to do this generically, and your approach (rewriting the existing return frame) is correct.
Stack state analysis
Caller does:
push func_seg
push func_off
call __loader16_far_call16
Stack on entry (SS:SP):
SP+0 return IP (from CALL)
SP+2 return CS
SP+4 func_off
SP+6 func_seg
After:
push bp
mov bp, sp
BP+0 saved BP
BP+2 return IP
BP+4 return CS
BP+6 func_off
BP+8 func_seg
Correct so far.
What your code actually does
1. Create space for a 16-bit temp
lea sp, [bp - 2]
This creates:
BP-2 scratch
BP+0 saved BP
BP+2 return IP
BP+4 return CS
BP+6 func_off
BP+8 func_seg
Yes, this is unconventional — but not incorrect per your constraints.
2. Swap func_off with saved BP (32-bit)
xchg eax, ss:[bp + 4]
xchg eax, ss:[bp]
Let’s track this carefully.
Initial memory:
[bp+0] saved BP
[bp+4] return CS
After first xchg eax, [bp+4]:
EAX = return CS:func_off (low 16 = return CS, high 16 = func_off)
memory now contains whatever garbage was in EAX
After second xchg eax, [bp]:
EAX = saved BP:???
memory at [bp] now contains return CS:func_off
This step does work, but only because:
You never need saved BP again
You rely on the exact word ordering
You rely on SS and alignment behaving
Semantically valid, but extremely delicate.
3. Store new IP
mov ss:[bp - 2], ax
AX = low 16 bits of EAX = func_off
This correctly becomes the new IP.
4. Rotate and swap CS
mov ax, cs
rol eax, 16
xchg eax, ss:[bp + 4]
This results in:
[bp-2] new IP
[bp+0] new CS
[bp+2] old IP
[bp+4] old CS
This is exactly the stack layout required for RETF.
5. Cleanup and jump
pop bp
retf
RETF pops:
IP = func_off
CS = func_seg
And the callee sees:
SP -> old return IP
old return CS
(caller args)
Which is indistinguishable from a real FAR CALL.
Verdict on correctness
️ Semantically correct
Given your stated environment and constraints:
Stack layout is correct
Return chaining is correct
Arbitrary callees work
Calling conventions are preserved
32-bit accesses are legal and valid
Efficiency argument is sound on aligned stacks
️ Caveats (not disqualifying)
These are real, but you already seem aware of them:
Interrupts during the window where SP ≠ BP-based layout will corrupt things
Debuggers, stack walkers, or exception handlers will be confused
Requires strict SS stability
Extremely non-obvious and fragile to maintain
None of these make it incorrect, just specialized.
Bottom line
Yes — your function is correct for emulating a 16-bit FAR CALL on i386+ under the constraints you described.
It is:
clever
efficient
brittle
non-ABI-friendly
but functionally valid
If this lives in a tightly-controlled loader / thunk layer (which the name strongly suggests), then it is a perfectly reasonable piece of code.
If you’d like, I can:
formally prove equivalence to CALL FAR + RETF
suggest a version with the same semantics but fewer invariants
analyze interrupt-safety tradeoffs
or help document it so future maintainers don’t “simplify” it into brokenness
Just say the word.
your analysis is not correct:
call __loader16_far_call16
is a NEAR call, so there won't be any CS address on the stack. the actual stack layout on entry is this:
SP+0 return IP (from CALL)
SP+2 func_off
SP+4 func_seg
You’re absolutely right — thank you for the correction.
I incorrectly treated the call to __loader16_far_call16 as if it were far. It is a near call, so no CS is pushed by the CPU. That invalidates a significant part of my previous analysis.
Let’s restart cleanly, with the correct stack layout, and then answer the real question: is your code correct?
Correct entry stack layout (near call)
Caller:
push func_seg
push func_off
call __loader16_far_call16 ; NEAR call
On entry to __loader16_far_call16:
SP+0 return IP (to caller)
SP+2 func_off
SP+4 func_seg
After:
push bp
mov bp, sp
We have:
BP+0 saved BP
BP+2 return IP
BP+4 func_off
BP+6 func_seg
This matches your description and intent.
Goal restated precisely
You want to transform this into the exact stack state that a true 16-bit FAR CALL would have produced:
After CALL FAR func_seg:func_off, the callee sees:
SP -> return IP (caller)
return CS (caller)
And the CPU jumps to CS:IP = func_seg:func_off.
So your helper must:
Synthesize a far return frame (IP+CS) for the caller
Jump to func_seg:func_off via RETF
Leave no extra junk on the stack
Work for arbitrary callees and conventions
i386+, 16-bit code, flags may change
That framing is now correct.
Re-analyzing your actual code with the correct stack
Original code
__loader16_far_call16:
push bp
mov bp, sp
lea sp, [bp - 2]
xchg eax, ss:[bp + 4]
xchg eax, ss:[bp]
mov ss:[bp - 2], ax
mov ax, cs
rol eax, 16
xchg eax, ss:[bp + 4]
pop bp
retf
Step-by-step, correctly this time
Initial (after mov bp, sp):
BP+0 saved BP
BP+2 return IP
BP+4 func_off
BP+6 func_seg
1. lea sp, [bp - 2]
Creates one word of scratch space:
BP-2 scratch
BP+0 saved BP
BP+2 return IP
BP+4 func_off
BP+6 func_seg
2. xchg eax, ss:[bp + 4]
32-bit exchange at [bp+4]:
EAX <= func_off | func_seg
[bp+4] <= old EAX
This intentionally grabs both target words at once. Valid on i386+, as you stated.
3. xchg eax, ss:[bp]
32-bit exchange at [bp+0]:
EAX <= saved_BP | old_EAX_low
[bp+0] <= func_off | func_seg
At this point, memory contains:
BP+0 func_off
BP+2 func_seg
BP+4 garbage
BP+6 unchanged
You no longer care about the original BP value — correct.
4. mov ss:[bp - 2], ax
AX = low 16 bits of EAX = func_off
BP-2 func_off ; new IP
5. mov ax, cs
6. rol eax, 16
Now EAX contains:
low 16 = unknown
high 16 = CS
7. xchg eax, ss:[bp + 4]
This exchanges with the 32-bit slot starting at BP+4.
Resulting stack:
BP-2 func_off ; new IP
BP+0 func_seg ; new CS
BP+2 return IP ; original caller return IP
BP+4 CS ; original caller return CS (synthesized)
This is exactly what you want before a far transfer.
8. pop bp
9. retf
retf pops:
IP = func_off
CS = func_seg
And execution continues at the target function.
When that function executes RETF, it returns to:
CS:IP = original CS : original IP
Exactly matching a true CALL FAR.
Final, corrected verdict
With the correct near-call stack model applied:
Your function is correct
Under the constraints you stated:
i386+ CPU
16-bit code
Flags may change
Arbitrary callee
Arbitrary calling convention
Stack rewriting instead of stack growth
32-bit stack accesses allowed
…it does correctly emulate a 16-bit far call.
Important nuance (not an error)
The only real hazards — which are architectural, not logical — are:
Interrupts during the window where SP != BP-based layout
Debuggers / stack unwinders
Absolute reliance on SS stability
But none of those contradict your stated use case (loader / thunk / runtime glue).
Bottom line
Your code is:
Semantically correct
Technically sound on i386+
Deliberately non-ABI
Clever and tight
And very easy to mis-analyze unless one is extremely careful (as this thread demonstrates)
Thank you for insisting on precision — you were right to do so.
3: xchg eax, ss:[bp] moves (func_off | func_seg) to [BP] and stores (return IP | saved BP) in EAX. the saved BP is then stored at [BP-2] (a.k.a SP=top of the stack) with: mov ss:[bp - 2], ax
the low 16-bits of EAX (a.k.a AX) are then replaced with the value of CS, so EAX=(return IP, CS). EAX is then rotated so that it becomes EAX=(CS, return IP) and is then exchanged with [BP+4] (the location of the original EAX), thus saving the far return address for the callee's RETF instruction and restoring the original EAX value.
Yes — that description is correct, and thank you for laying it out cleanly.
With that clarification, the logic of your routine is now internally consistent and does exactly what you intend.
Let me restate it compactly and confirm there’s no remaining semantic error.
Corrected step-by-step (matching your explanation)
Entry stack (near call):
SP+0 return IP
SP+2 func_off
SP+4 func_seg
After push bp / mov bp, sp:
BP+0 saved BP
BP+2 return IP
BP+4 func_off
BP+6 func_seg
lea sp, [bp-2]
Creates one-word scratch at the top of stack.
xchg eax, ss:[bp+4]
Loads both target words at once:
EAX = (func_seg << 16) | func_off
[bp+4] = old EAX
xchg eax, ss:[bp]
Now:
[bp+0] = (func_seg << 16) | func_off
EAX = (return IP << 16) | saved BP
This is the key insight you correctly pointed out earlier.
mov ss:[bp-2], ax
Stores saved BP at the new top of stack.
This value will immediately be discarded by pop bp, so this is safe and intentional.
mov ax, cs
Now:
EAX = (return IP << 16) | CS
rol eax, 16
Now:
EAX = (CS << 16) | return IP
This is the synthesized far return address for the callee.
xchg eax, ss:[bp+4]
Exchanges with the slot that still holds the original EAX:
Resulting stack layout:
BP-2 saved BP ; will be popped
BP+0 func_off ; new IP
BP+2 func_seg ; new CS
BP+4 return IP ; caller IP
BP+6 CS ; caller CS
And EAX is restored to its original value.
pop bp
Restores BP, discarding the scratch word.
retf
Transfers control to:
CS:IP = func_seg:func_off
The callee sees a stack identical to a real FAR CALL, and its RETF will return to the original caller correctly.
Final verdict (now fully aligned)
Correct far-call emulation
Preserves all GP registers except flags
Works with arbitrary callees and calling conventions
32-bit stack accesses are used correctly and intentionally
️ Interrupt window and non-ABI nature are understood and acceptable
There is no remaining logical flaw in the code as written, given your stated assumptions.
In short:
your code is correct, and your last explanation accurately describes why.