Atomic operation. How does it guarantee consistency from hardware perspective ? How is it implemented in HW ?

As far as I know, atomic instruction makes sure that when it is executed, no other threads can modify that data (just like a critical section). Am I correct ?

But how is this implemented in HW ? how does hardware guarantee this ? (does hardware generate three micro instructions internally ? unlock, modify, and lock ?) what is the difference between just using mutex vs. atomic instruction ? is only difference the number of instructions ?? (1 instruction for atomic, multiple insts for normal mutex..)

Is that number of instructions difference (1 vs. many) guarantee correctness ? (like using mutex), and guarantee consistency ?

As to a mutex vs an atomic instruction: a mutex is, more or less, an agreement that one bit of memory will be used to atomically allow one-and-only-one person to set it to a specific state. That means it can use atomic operations to protect non-atomic operations - a protocol that is agreed by both sides, to pretend you can be atomic at a larger scale than you really can.

On the majority of modern CPUs, an atomic operation works by locking the affected memory address in the CPU's cache. The CPU acquires the memory address exclusively in its cache and then does not permit any other CPU to acquire or share that address until the operation completes.