To transform a point in world to screen space you need to run it through the camera projections, much like you do with any graphics that you render. Let's say you start with a position in world space, you do WORLD -> VIEW -> PROJECTION. After that you have the position in clip space. If you then divide xyz by w (remember that clip space has 4 coordinates) you get NDC or normalized device coordinates. You can then map these to whatever you want, eg. pixels. This is how I do it in my engine:

That gives you a ray in camera space (view space). You can then transform this into world space and do ray casting against objects using that ray. How you do the ray casting is up to you (and how the scene is organized), but I first do a ray-AABB test against all objects and then a ray-triangle test once I hit an AABB, to optimize a bit. This might not be the most efficient way.

The same way you generate them when you draw stuff. For example, in XNA there's Matrix.CreateLookAt() which you can use for view matrices, and something like Matrix.CreatePerspectiveFieldOfView() to create a projection matrix. The view matrix describes your cameras positioning and the projection matrix describes your camera 'lens'. Somewhat simplified.

Edit: If you don't know these already, chances are you are trying to learn picking etc before trying to learn rendering. You should probably go the other way around, check some tutorials on 3d rendering instead and make sure you understand the concepts of vertices, matrices, transformations, pipelines etc. Working with 3D is very different from working with 2D, so you shouldn't try to solve your problems the same ways.