I genuinely don't know how these benchmarks work, but surely it's apples to oranges. Or does geekbench somehow calculate things as if it were app per app? what i mean is that just because say an ipad and SP3 scored the same, wouldn't that just be because the ipad is running the software on a stripped out refined mobile system, where's the sp3 would be running the software through big and bloaty windows.
The OS doesn't impact it too much although for accurate readings you should be in airplane mode etc. like calculating pi these tests are very CPU focused.
However you have a point about comparing apples and oranges but still; when in an Apple race for desktop champion, you get smoked by an Orange it doesn't look good.

Geek bench does have a reputation for reading comparatively a little low on Intel CPU's which translates to: a tie between an A9X and an Intel CPU = a win for Intel because in the real world of general computing it will out perform it.
Ideally, in the purest sense you only want to compare like to like with benchmarks in the same system or test environment i.e. A9X to A8X is valid, i5 4300 to i5 5300 is valid, A9X to Snapdragon 820 or Exynos 7420 is slightly less valid but useful as they are based on the same ARM architectures and the further you get from like the less meaningful it becomes. If you compare a Snapdragon 810 in a Samsung tablet to a Snapdragon 810 in a Sony tablet you might still get different results but those would show differences in memory or other motherboard and system firmware tuning.
All that to say... it's not completely valid but it's not completely invalid as in a race to the moon we're not giving style points for how you get there. The better you understand everything involved the more meaningful it becomes with the requisite grain of salt because most don't run a single benchmark or calculate pi as their sole use case but if you wanted a system that calculated pi err ran geek bench the fastest your decision is made.
