Using profiling tools is a common way to understand computer systems and software and to achieve the best performance. Profiling becomes more important as computing technology advances and makes it more difficult to intuitively reason about system characteristics. However, the recent shift in computing technology to multicore systems and heterogeneous systems requires new profiling methods that are more suited to the challenges of profiling multiple processing elements and multiple types of resources. In this dissertation, we focus on an important profiling problem for each of three application classes on modern hardware: multithreaded applications, multiprogrammed workloads, and heterogeneous systems. For multithreaded applications, we target reducing the overhead of collecting a trace of application characteristics such as memory references. Reducing the overhead reduces the impact on thread interleavings in a multithreaded application. We reduce the overhead by buffering gathered profile data in a dynamic binary instrumentation system to decouple collection of profile data from processing of profile data. By controlling the code that is generated to fill the buffer and using a variety of methods to empty the buffer, we reduce the overhead by half compared to the previous best implementation in the system. For multiprogrammed workloads, we focus on profiling temperatures on multicore systems. We first isolate several issues relating to profiling applications and temperatures on such workloads, such as sampling frequency effects, temperature and application representations, and aligning multiple profiles. We then use our profile information to show that linear regression on application characteristics is unsuitable for modeling application temperatures, but that changes in application characteristics that indicate phase changes can also model temperature changes. We also analyze the effects of temperature spreading amongst cores and the different thermal responses of different cores. Finally, we address collecting a full-system profile on heterogeneous systems. Existing profiling methods focus on each resource in isolation without considering the interactions between multiple resources. We present and build upon a straightforward method for combining profiles on heterogeneous systems that can show how CPU performance relates to GPU performance. We then analyze the potential performance impact of profiling the application. We also extend our method for combining profiles with a data flow pass that helps determine how data computed by the GPU was originally generated on the CPU. Our contributions include improved performance on multithreaded tracing, an analysis of novel approaches to predicting temperatures based on application characteristics, and the first method to collect full-system profiles on heterogeneous systems. The insights gained can also be applied to other pro?ling tasks, enabling more tools to uncover more performance in modern and future computing systems.