Eric Schrock's Webloghttps://blogs.oracle.com/eschrock/
Reflections on OS integrationen-usCopyright 2010Wed, 8 Dec 2010 14:35:41 +0000Apache Roller BLOGS401ORA6 (20130904125427)https://blogs.oracle.com/eschrock/entry/ses_sensorsSES Sensors and Indicatorseschrockhttps://blogs.oracle.com/eschrock/entry/ses_sensors
Thu, 7 Aug 2008 08:32:39 +0000OpenSolaris<p>Last week, Rob Johnston and I coordinated two putbacks to Solaris to further the cause of Solaris platform integration, this time focusing on sensors and indicators. Rob has a great blog post with an overview of the new <a href="http://blogs.sun.com/robj/entry/a_sensor_abstraction_layer_for">sensor abstraction layer</a> in libtopo. Rob did most of the hard work- my contribution consisted only of extending the <a href="http://blogs.sun.com/eschrock/entry/solaris_sensors_and_indicators">SES enumerator</a> to support the new facility infrastructure.</p>
<p>You can find a detailed description of the changes in the original FMA portfolio <a href="http://blogs.sun.com/eschrock/resource/ses_sensors.txt">here</a>, but it's much easier to understand via demonstration. This is the <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/fmtopo/">fmtopo</a> output for a fan node in a <a href="http://www.sun.com/storagetek/disk_systems/expansion/4400/">J4400 JBOD</a>:</p>
<pre>
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
label string Cooling Fan 0
FRU fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
group: authority version: 1 stability: Private/Private
product-id string SUN-Storage-J4400
chassis-id string 2029QTF0000000005
server-id string
group: ses version: 1 stability: Private/Private
node-id uint64 0x1f
target-path string /dev/es/ses3
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident
group: authority version: 1 stability: Private/Private
product-id string SUN-Storage-J4400
chassis-id string 2029QTF0000000005
server-id string
group: facility version: 1 stability: Private/Private
type uint32 0x1 (LOCATE)
mode uint32 0x0 (OFF)
group: ses version: 1 stability: Private/Private
node-id uint64 0x1f
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail
group: authority version: 1 stability: Private/Private
product-id string SUN-Storage-J4400
chassis-id string 2029QTF0000000005
server-id string
group: facility version: 1 stability: Private/Private
type uint32 0x0 (SERVICE)
mode uint32 0x0 (OFF)
group: ses version: 1 stability: Private/Private
node-id uint64 0x1f
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed
group: authority version: 1 stability: Private/Private
product-id string SUN-Storage-J4400
chassis-id string 2029QTF0000000005
server-id string
group: facility version: 1 stability: Private/Private
sensor-class string threshold
type uint32 0x4 (FAN)
units uint32 0x12 (RPM)
reading double 3490.000000
state uint32 0x0 (0x00)
group: ses version: 1 stability: Private/Private
node-id uint64 0x1f
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault
group: authority version: 1 stability: Private/Private
product-id string SUN-Storage-J4400
chassis-id string 2029QTF0000000005
server-id string
group: facility version: 1 stability: Private/Private
sensor-class string discrete
type uint32 0x103 (GENERIC_STATE)
state uint32 0x1 (DEASSERTED)
group: ses version: 1 stability: Private/Private
node-id uint64 0x1f
</pre>
<p>Here you can see the available indicators (locate and service), the fan speed (3490 RPM) and if the fan is faulted. Right now this is just interesting data for savvy administrators to play with, as it's not used by any software. But that will change shortly, as we work on the next phases:</p>
<ul>
<li>Monitoring of sensors to detect failure in external components which have no visibility in Solaris outside libtopo, such as power supplies and fans. This will allow us to generate an FMA fault when a power supply or fan fails, regardless of whether it's in the system chassis or an external enclosure.</li>
<li>Generalization of the <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/disk-monitor/">disk-monitor</a> fmd plugin to support arbitrary disks. This will control the failure indicator in response to FMA-diagnosed faults.</li>
<li>Correlation of ZFS faults with the associated physical disk. Currently, ZFS faults are against a "vdev" - a ZFS-specific construct. The user is forced to translate from this vdev to a device name, and then use the normal (i.e. painful) methods to figure out which physical disk was affected. With a little work it's possible to include the physical disk in the FMA fault to avoid this step, and also allow the fault LED to be controlled in response to ZFS-detected faults.</li>
<li>Expansion of the SCSI framework to support native diagnosis of faults, instead of a stream of syslog messages. This involves generating telemetry in a way that can be consumed by FMA, as well as a diagnosis engine to correlate these ereports with an associated fault.</li>
</ul>
<p>Even after we finish all of these tasks and reach the nirvana of a unified storage management framework, there will still be lots of open questions about how to leverage the sensor framework in interesting ways, such as a prtdiag-like tool for assembling sensor information, or threshold alerts for non-critical warning states. But with these latest putbacks, it feels like our goals from two years ago are actually within reach, and that I will finally be able to turn on that elusive LED.</p>https://blogs.oracle.com/eschrock/entry/external_storage_enclosures_in_solarisExternal storage enclosures in Solariseschrockhttps://blogs.oracle.com/eschrock/entry/external_storage_enclosures_in_solaris
Sun, 13 Jul 2008 21:23:39 +0000OpenSolaris<p>Over the past few years, I've been working on various parts of Solaris <a href="http://blogs.sun.com/eschrock/entry/solaris_platform_integration">platform integration</a>, with an emphasis on <a href="http://blogs.sun.com/eschrock/entry/solaris_platform_integration_generic_disk">disk monitoring</a>. While the majority of my time has been focused on fishworks, I have managed to implement a few more pieces of the original design.</p>
<p>About two months ago, I integrated the <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/scsi/libscsi/">libscsi</a> and <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/scsi/libses/">libses</a> libraries into Solaris Nevada. These libraries, originally written by Keith Wesolowski, form an abstraction layer upon which higher level software can be built. The modular nature of libses makes it easy to extend with vendor-specific support libraries in order to provide additional information and functionality not present in the SES standard, something difficult to do with the kernel-based ses(7d) driver. And since it is written in userland, it is easy to port to other operating systems. This library is used as part of the <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fwflash/">fwflash</a> firmware upgrade tool, and will be used in future Sun storage management products.</p>
<p>While libses itself is an interesting platform, it's true raison d'etre is to serve as the basis for enumeration of external enclosures as part of libtopo. Enumeration of components in a physically meaningful manner is a key component of the FMA strategy. These components form FMRIs (fault managed resource identifiers) that are the target of diagnoses. These FMRIs provide a way of not just identifying that "disk c1t0d0 is broken", but that this device is actually in bay 17 of the storage enclosure whose chassis serial number is "2029QTF0809QCK012". In order to do that effectively, we need a way to discover the physical topology of the enclosures connected to the system (chassis and bays) and correlate it with the in-band I/O view of the devices (SAS addresses). This is where SES (SCSI enclosure services) comes into play. SES processes show up as targets in the SAS fabric, and by using the additional element status descriptors, we can correlate physical bays with the attached devices under Solaris. In addition, we can also enumerate components not directly visible to Solaris, such as fans and power supplies.</p>
<p>The SES enumerator was integrated in build 93 of nevada, and all of these components now show up in the libtopo hardware topology (commonly referred to as the "hc scheme"). To do this, we walk over al the SES targets visible to the system, grouping targets into logical chassis (something that is not as straightforward as it should be). We use this list of targets and a snapshot of the Solaris device tree to fill in which devices are present on the system. You can see the result by running <tt>fmtopo</tt> on a build 93 or later Solaris machine:</p>
<pre>
# /usr/lib/fm/fmd/fmtopo
...
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:serial=2029QTF0000000002:part=Storage-J4400:revision=3R13/ses-enclosure=0
hc://:product-id=SUN-Storage-J4400:chassis-id=22029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=0
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=1
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=0
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=1
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=2
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=3
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0386:part=375-3584-01/ses-enclosure=0/controller=0
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0074:part=375-3584-01/ses-enclosure=0/controller=1
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=0
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=1
...
</pre>
<p>To really get all the details, you can use the '-V' option to fmtopo to dump all available properties:</p>
<pre>
# fmtopo -V '\*/ses-enclosure=0/bay=0/disk=0'
TIME UUID
Jul 14 03:54:23 3e95d95f-ce49-4a1b-a8be-b8d94a805ec8
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
ASRU fmri dev:///:devid=id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________//scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
label string SCSI Device 0
FRU fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
group: authority version: 1 stability: Private/Private
product-id string SUN-Storage-J4400
chassis-id string 2029QTF0809QCK012
server-id string
group: io version: 1 stability: Private/Private
devfs-path string /scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
devid string id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________
phys-path string[] [ /pci@0,0/pci10de,377@a/pci1000,3150@0/disk@1c,0 /pci@0,0/pci10de,375@f/pci1000,3150@0/disk@1c,0 ]
group: storage version: 1 stability: Private/Private
logical-disk string c0tATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3Xd0
manufacturer string SEAGATE
model string ST37500NSSUN750G 0720A0PC3X
serial-number string 5QD0PC3X
firmware-revision string 3.AZK
capacity-in-bytes string 750156374016
</pre>
<p>So what does this mean, other than providing a way for you to finally figure out where disk 'c3t0d6' is really located? Currently, it allows the disks to be monitored by the <tt>disk-transport</tt> fmd module to generate faults based on predictive failure, over temperature, and self-test failure. The really interesting part is where we go from here. In the near future, thanks to work by Rob Johnston on the <a href="http://blogs.sun.com/eschrock/entry/solaris_sensors_and_indicators">sensor framework</a>, we'll have the ability to manage LEDs for disks that are part of external enclosures, diagnose failures of power supplies and fans, as well as the ability to read sensor data (such as fan speeds and temperature) as part of a unified framework.</p>
<p>I often like to joke about the amount of time that I have spent just getting a single LED to light. At first glance, it seems like a pretty simple task. But to do it in a generic fashion that can be generalized across a wide variety of platforms, correlated with physically meaningful labels, and incorporate a diverse set of diagnoses (ZFS, SCSI, HBA, etc) requires an awful lot of work. Once it's all said and done, however, future platforms will require little to no integration work, and you'll be able to see a bad drive generate checksum errors in ZFS, resulting in a FMA diagnosis indicating the faulty drive, activate a hot spare, and light the fault LED on the drive bay (wherever it may be). Only then will we have accomplished our goal of an end-to-end storage strategy for Solaris - and hopefully someone besides me will know what it has taken to get that little LED to light.</p>https://blogs.oracle.com/eschrock/entry/solaris_sensors_and_indicatorsSolaris Sensors and Indicatorseschrockhttps://blogs.oracle.com/eschrock/entry/solaris_sensors_and_indicators
Sat, 9 Jun 2007 14:31:29 +0000OpenSolaris<p>For those of you who have been following my recent work with Solaris platform integration, be sure to check out the work Cindi and the FMA team are doing as part of the <a href="http://opensolaris.org/os/project/sensors/">Sensor Abstraction Layer</a> project. Cindi recently <a href="http://blogs.sun.com/cindi/entry/sensor_tivity">posted</a> an initial version of the <a href="http://opensolaris.org/os/project/sensors/design-01.pdf">Phase 1 design document</a>. Take a look if you're interested in the details, and <a href="http://www.opensolaris.org/jive/thread.jspa?threadID=32517">join the discussion</a> if you're interested in defining the Solaris platform experience.</p>
<p>The implications of this project for unified platform integration are obvious. With respect to what I've been working on, you'll likely see the current disk monitoring infrastructure converted into generic sensors, as well as the sfx4500-disk LED support converted into indicators. I plan to leverage this work as well as the SCSI FMA work to enable correlated ZFS diagnosis across internal and external storage.</p>https://blogs.oracle.com/eschrock/entry/solaris_platform_integration_generic_diskSolaris platform integration - disk monitoringeschrockhttps://blogs.oracle.com/eschrock/entry/solaris_platform_integration_generic_disk
Sat, 26 May 2007 18:34:04 +0000OpenSolaris<p>Two weeks ago I putback PSARC 2007/202, the second step in <a href="http://blogs.sun.com/eschrock/entry/solaris_platform_integration">generalizing the x4500 disk monitor</a>. As explained in my previous blog post, one of the tasks of the original sfx4500-disk module was reading SMART data from disks and generating associated FMA faults. This platform-specific functionality needed to be generalized to effectively support future Sun platforms.</p>
<p>This putback did not add any new user-visible features to Solaris, but it did refactor the code in the following ways:</p>
<ul>
<li><p>A new private library, <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/fm/libdiskstatus/">libdiskstatus</a>, was added. This generic library uses uSCSI to read data from SCSI (or SATA via emulation) devices. It is not a generic SMART monitoring library, focusing only on the three generally available disk faults: over temperature, predictive failure, and self-test failure. There is a single function, <tt>disk_status_get()</tt> that reurns an nvlist describing the current parameters reported by the drive and whether any faults are present.</p></li>
<li><p>This library is used by the <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/fm/topo/modules/i86pc/sata/sata.c">SATA libtopo module</a> to export a generic <tt>TOPO_METH_DISK_STATUS</tt> method. This method keeps all the implementation details within libtopo and exports a generic inerface for consumers.</p></li>
<li><p>A new fmd module, <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/disk-transport/">disk-transport</a>, periodically iterates over libtopo nodes and invokes the <tt>TOPO_METH_DISK_STATUS</tt> method on any supported nodes. The module generates FMA ereports for any detected errors.</p></li>
<li><p>These ereports are translated to faults by a simple <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/eversholt/files/common/disk.esc">eversholt DE</a>. These are the same faults that were originally generated by the sfx4500-disk module, so the code that consumes them remains unchanged.</p></li>
</ul>
<p>These changes form the foundation that will allow future Sun platforms to detect and react to disk failures, eliminating 5200 lines of platform-specific code in the process. The next major steps are currently in progress:</p>
<p>The FMA team, as part of the <a href="http://www.opensolaris.org/os/project/sensors/">sensor framework</a>, is expanding libtopo to include the ability to represent indicators (LEDs) in a generic fashion. This will replace the <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/fm/topo/modules/i86pc/sata/sfx4500_props.c">x4500 specific properties</a> and associated machinery with generic code.</p>
<p>The SCSI FMA team is finalizing the libtopo enumeration work that will allow arbitrary SCSI devices (not just SATA) to be enumerated under libtopo and therefore be monitored by the <tt>disk-transport</tt> module. The first phase will simply replicate the existing sfx4500-disk functionality, but will enable us to model future non-SATA platforms as well as external storage devices.</p>
<p>Finally, I am finishing up my long-overdue ZFS FMA work, a necessary step towards connecting ZFS and disk diagnosis. Stay tuned for more info.</p>https://blogs.oracle.com/eschrock/entry/solaris_platform_integrationSolaris platform integration - libipmieschrockhttps://blogs.oracle.com/eschrock/entry/solaris_platform_integration
Sat, 17 Mar 2007 19:17:11 +0000OpenSolaris<p>As I continue down the path of improving various aspects of ZFS and Solaris platform integration, I found myself in the thumper (x4500) <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/i86pc/sfx4500-disk/">fmd platform module</a>. This module represents the latest attempt at Solaris platform integration, and an indication of where we are headed in the future.</p>
<p>When I say "platform integration", this is more involved than the platform support most people typically think of. The platform teams make sure that the system boots and that all the hardware is supported properly by Solaris (drivers, etc). Thanks to the FMA effort, platform teams must also deliver a FMA portfolio which covers FMA support for all the hardware and a unified serviceability plan. Unfortunately, there is still more work to be done beyond this, of which the most important is interacting with hardware in response to OS-visible events. This includes ability to light LEDs in response to faults and device hotplug, as well as monitoring the service processor and keeping external FRU information up to date.</p>
<p>The sfx4500-disk module is the latest attempt at providing this functionality. It does the job, but is afflicted by the same problems that often plague platform integration attempts. It's overcomplicated, monolithic, and much of what it does should be generic Solaris functionality. Among the things this module does:</p>
<ul>
<li>Reads SMART data from disks and creates ereports</li>
<li>Diagnoses ereports into corresponding disk faults</li>
<li>Implements an IPMI interface directly on top of <tt>/dev/bmc</tt></li>
<li>Responds to disk faults by turning on the appropriate 'fault' disk LED</li>
<li>Listens for hotplug and DR events, updating the 'ok2rm' and 'present' LEDs</li>
<li>Updates SP-controlled FRU information</li>
<li>Monitors the service process for resets and resyncs necessary information</li>
</ul>
<p>Needless to say, every single item on the above list is applicable to a wide variety of Sun platforms, not just the x4500, and it certainly doesn't need to be in a single monolithic module. This is not meant to be a slight against the authors of the module. As with most platform integration activities, this effort wasn't communicated by the hardware team until far too late, resulting in an unrealistic schedule with millions of dollars of revenue behind it. It doesn't help that all these features need to be supported on Solaris 10, making the schedule pressure all the more acute, since the code must soak in Nevada and then be backported in time for the product release. In these environments even the most fervent pleas for architectural purity tend to fall on deaf ears, and the engineers doing the work quickly find themselves between a rock and a hard place.</p>
<p>As I was wandering through this code and thinking about how this would interact with ZFS and future Sun products, it became clear that it needed a massive overhaul. More specifically, it needed to be burned to the ground and rebuilt as a set of distinct, general purpose, components. Since refactoring 12,000 lines of code with such a variety of different functions is non-trivial and difficult to test, I began by factoring out different pieces individually, redesigning the interfaces and re-integrating them into Solaris on a piece-by-piece basis.</p>
<p>Of all the functionality provided by the module, the easiest thing to separate was the IPMI logic. The <a href="http://www.intel.com/design/servers/ipmi/">Intelligent Platform Management Interface</a> is a specification for communicating with service Pprocessors to discover and control available hardware. Sadly, it's anything but "intelligent". If you had asked me a year ago what I'd be doing at the beginning of this year, I'm pretty sure that reading the IPMI specification would have been at the bottom of my list (right below driving stakes through my eyeballs). Thankfully, the IPMI functionality needed was very small, and the best choice was a minimally functional private library, designed solely for the purpose of communicating with the Service Processor on supported Sun platforms. Existing libraries such as <a href="http://openipmi.sourceforge.net/">OpenIPMI</a> were too complicated, and in their efforts to present a generic abstracted interface, didn't provide what we really needed. The design goals are different, and the ON-private IPMI library and OpenIPMI will continue to develop and serve different purposes in the future.</p>
<p>Last week I finally integrated <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libipmi/">libipmi</a>. In the process, I eliminated 2,000 lines of platform-specific code and created a common interface that can be leveraged by other FMA efforts and future projects. It is provided for both x86 and SPARC, even though there are currently no supported SPARC machines with an IPMI-capable service processor (this is being worked on). This library is private and evolving quite rapidly, so don't use it in any non-ON software unless you're prepared to keep up with a changing API.</p>
<p>As part of this work, I also created a common fmd module, <a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/sp-monitor/">sp-monitor</a>, that monitors the service processor, if present, and generates a new <tt>ESC_PLATFORM_RESET</tt> sysevent to notify consumers when the service processor is reset. The existing sfx4500-disk module then consumes this sysevent instead of monitoring the service processor directly.</p>
<p>This is the first of many steps towards eliminating this module in its current form, as well as laying groundwork for future platform integration work. I'll post updates to this blog with information about generic disk monitoring, libtopo indicators, and generic hotplug management as I add this functionality. The eventual goal is to reduce the platform-specific portion of this module to a single .xml file delivered via libtopo that all these generic consumers will use to provide the same functionality that's present on the x4500 today. Only at this point can we start looking towards future applications, some of which I will describe in upcoming posts.</p>https://blogs.oracle.com/eschrock/entry/dtrace_sysevent_providerDTrace sysevent providereschrockhttps://blogs.oracle.com/eschrock/entry/dtrace_sysevent_provider
Wed, 14 Mar 2007 21:18:54 +0000OpenSolaris<p>I've been heads down for a long time on a new project, but occasionally I do put something back to ON worth blogging about. Recently I've been working on some problems which leverage sysevents (<tt>libsysevent(3LIB)</tt>) as a common transport mechanism. While trying to understand exactly what sysevents were being generated from where, I found the lack of observability astounding. After poking around with DTrace, I found that tracking down the exact semantics was not exactly straightforward. First of all, we have two orthogonal sysevent mechanisms, the original <tt>syseventd</tt> legacy mechanism, and the more recent general purpose event channel (GPEC) mechanism, used by FMA. On top of this, the <tt>sysevent_impl_t</tt> structure isn't exactly straightforward, because all the data is packed together in a single block of memory. Knowing that this would be important for my upcoming work, I decided that adding a stable DTrace sysevent provider would be useful.</p>
<p>The provider has a single probe, <tt>sysevent:::post</tt>, which fires whenever a sysevent post attempt is made. It doesn't necessarily indicate that the syevent was successfully queued or received. The probe has the following semantics:</p>
<pre>
# dtrace -lvP sysevent
ID PROVIDER MODULE FUNCTION NAME
44528 sysevent genunix queue_sysevent post
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics: Evolving
Dependency Class: ISA
Argument Types
args[0]: syseventchaninfo_t \*
args[1]: syseventinfo_t \*
</pre>
<p>The 'syseventchaninfo_t' translator has a single member, 'ec_name',which is the name of the event channel. If this is being posted via the legacy sysevent mechanism, then this member will be NULL. The 'syeventinfo_t' translator has three members, 'se_publisher', 'se_class', and 'se_subclass'. These mirror the arguments to <tt>sysevent_post()</tt>. The following script will dump all sysevents posted to <tt>syseventd(1M)</tt>:</p>
<pre>
#!/usr/sbin/dtrace -s
#pragma D option quiet
BEGIN
{
printf("%-30s %-20s %s\\n", "PUBLISHER", "CLASS",
"SUBCLASS");
}
sysevent:::post
/args[0]->ec_name == NULL/
{
printf("%-30s %-20s %s\\n", args[1]->se_publisher,
args[1]->se_class, args[1]->se_subclass);
}
</pre>
<p>And the output during a <tt>cfgadm -c unconfigure</tt>:</p>
<pre>
PUBLISHER CLASS SUBCLASS
SUNW:usr:devfsadmd:100237 EC_dev_remove disk
SUNW:usr:devfsadmd:100237 EC_dev_branch ESC_dev_branch_remove
SUNW:kern:ddi EC_devfs ESC_devfs_devi_remove
</pre>
<p>This has already proven quite useful in my ongoing work, and hopefully some other developers out there will also find it useful.</p>https://blogs.oracle.com/eschrock/entry/first_sponsored_bugfixFirst sponsored bugfixeschrockhttps://blogs.oracle.com/eschrock/entry/first_sponsored_bugfix
Mon, 12 Sep 2005 21:15:32 +0000OpenSolaris<p>Yes, I am still here. And yes, I'm still working on ZFS as fast as I can. But I do have a small amount of free time, and managed to pitch in with some of the OpenSolaris bug sponsor efforts over at the <a href="http://www.opensolaris.org/jive/forum.jspa?forumID=27'>request-sponsor</a> forum. I figure I can handle a bug every week or two even with ZFS in full "end game" swing, and hopefully inspire others to jump on the sponsor bandwagon during this interim period. In particular, I helped Rich Lowe integrate <a href="http://www.opensolaris.org/jive/thread.jspa?threadID=1746&tstart=0">two basic code cleanup fixes</a> into the Nevada gate. Nothing spectacular, but worthy of a proof of concept, and it adds another name to the list of contributors who have had fixes putback into Nevada. Next week I'll try to grab one of the remaining bugfixes to lend a hand. Maybe someday I'll have enough time to blog for real, but don't expect much until ZFS is back in the gate.</p>
<p>Also, check out the <a href=
http://www.opensolaris.org/os/community/onnv/putbacks_20050909.html">Nevada putback logs</a> for build 22. Very cool stuff - kudos to Steve and the rest of the OpenSolaris team. Pay attention to the fixes contributed by Shawn Walker and Jeremy Teo - It's nice to see active work being done, despite the fact that we still have so much work left to do in building an effective community.</p>
<p>Technorati Tag: <a href="http://technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris</a></p> https://blogs.oracle.com/eschrock/entry/fame_glory_and_a_freeFame, Glory, and a free iPod shuffleeschrockhttps://blogs.oracle.com/eschrock/entry/fame_glory_and_a_free
Thu, 11 Aug 2005 09:46:47 +0000OpenSolaris<p>Thanks to Jarod Jenson (of <a href="http://www.sun.com/software/solaris/javaone_challenge.jsp">DTrace</a> and <a href="http://www.aeysis.com/">Aeysis</a> fame), I now have a shiny new 512MB iPod shuffle to give away to a worthy OpenSolaris cause. Back when I posted my original <a href="http://blogs.sun.com/roller/page/eschrock?entry=a_parting_mdb_challenge">MDB challenge</a>, I had no cool stuff to entice potential suitors. So now I'll offer this iPod shuffle to the first person who submits an acceptable solution to the problem <b>and</b> follows through to integrate the code into OpenSolaris (I will sponsor any such RFE). Send your diffs against the latest OpenSolaris source to me at <i>eric dot schrock at sun dot com</i>. We'll put a time limit of, say, a month and a half (until 10/1) so that I can safely recycle the iPod shuffle into another challenge should no one respond.</p>
<p>Once again, the original challenge is <a href="http://blogs.sun.com/roller/page/eschrock?entry=a_parting_mdb_challenge">here</a>.
<p>So besides the fame and glory of integrating the first non-bite size RFE into OpenSolaris, you'll also walk away with a cool toy. Not to mention all the MDB knowledge you'll have under your belt. Feel free to email me questions, or head over to the <a href="http://www.opensolaris.org/jive/forum.jspa?forumID=4">mdb-discuss</a> forum. Good Luck!</p>
<p class="tag">Tags: <a href="http://technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris</a> <a href="http://technorati.com/tag/MDB" rel="tag">MDB</a></p>https://blogs.oracle.com/eschrock/entry/where_have_i_beenWhere have I been?eschrockhttps://blogs.oracle.com/eschrock/entry/where_have_i_been
Mon, 8 Aug 2005 10:00:45 +0000OpenSolaris<p>It's been almost a month since my last blog post, so I thought I'd post an update. I spent the month of July in Massachusetts, alternately on vacation, working remotely, and attending my brother's wedding. The rest of the LAE (Linux Application Environment) team joined me (and Nils) for a week out there, and we made some huge progress on the project. For the curious, we're working on how best to leverage OpenSolaris to help the project and the community, at which point we can go into more details about what the final product will look like. Until then, suffice to say "we're working on it". All this time on LAE did prevent me from spending time with my other girlfriend, ZFS. Since getting back, I've caught up with most of the ZFS work in my queue, and the team has made huge progress on ZFS in my absence. As much as I'd like to talk about details (or a schedule), I can't :-( But trust me, you'll know when ZFS integrates into Nevada; there are many bloggers who will not be so quiet when that putback notice comes by. Not to mention that the source code will hit <a href="http://www.opensolaris.org">OpenSolaris</a> shortly thereafter.</p>
<p>Tomorrow I'll be up at LinuxWorld, hanging out at the booth with <a href="http://www.cuddletech.com/blog/">Ben</a> and hosting the <a href="http://www.linuxworldexpo.com/live/12/events/12SFO05A/conference/tracksessions//QMONYA04O6SS">OpenSolaris BOF</a> along with <a href="http://blogs.sun.com/ahl">Adam</a> and <a href="http://blogs.sun.com/bmc">Bryan</a> (<a href="http://blogs.sun.com/dp">Dan</a> will be there as well, though he didn't make the "official" billing). Whether you know nothing about OpenSolaris or are one of our dedicated community members, come check it out.</p>https://blogs.oracle.com/eschrock/entry/operating_system_tunablesOperating system tunableseschrockhttps://blogs.oracle.com/eschrock/entry/operating_system_tunables
Tue, 12 Jul 2005 17:04:46 +0000OpenSolaris<p>There's an interesting discussion over at <a href="http://www.opensolaris.org/jive/thread.jspa?threadID=1153&tstart=0">opensolaris-code</a>, spawned from an initial request to add some tunables to Solaris /proc. This exposes a few very important philosophical differences between Solaris and other operating systems out there. I encourage you to read the thread in its entirety, but here's an executive summary:</p>
<ul>
<li><p><b>When possible, the system should be auto-tuning</b> - If you are creating a tunable to control fine grained behavior of your program or operating system, you should first ask yourself: "Why does this tunable exist? Why can't I just pick the best value?" More often than not, you'll find the answer is "Because I'm lazy" or "The problem is too hard." Only in rare circumstances is there ever a definite need for a tunable, and almost always control coarse on-off behavior.</p></li>
<li><p><b>If a tunable is necessary, it should be as specific as possible</b> - The days of dumping every tunable under the sun into <tt>/etc/system</tt> are over. Very rarely do tunables need to be system wide. Most tunables should be per process, per connection, or per filesystem. We are continually converting our old system-wide tunables into per-object controls.</p></li>
<li><p><b>Tunables should be controlled by a well defined interface</b> - <tt>/etc/system</tt> and <tt>/proc</tt> are not your personal landfills. <tt>/etc/system</tt> is by nature undocumented, and designing it as your primary interface is fundamentally wrong. While <tt>/proc</tt> is well documented, but it's also well defined to be a <i>process filesystem</i>. Besides the enormous breakage you'd introduce by adding <tt>/proc/tunables</tt>, its philosophically wrong. The <tt>/system</tt> directory is a slightly better choice, but it's intended primarily for observability of subsystems that translate well to a hierarchical layout. In general, we don't view filesystems as a primary administrative interface, but a programmatic API upon which more sophisticated tools can be built.</p></li>
</ul>
<p>One of the best examples of these principles can been seen in the updated System V IPC tunables. <a href="http://blogs.sun.com/dep">Dave Powell</a> rewrote this arcane set of /etc/system tunables during the course of Solaris 10. Many of the tunables were made auto-tuning, and those that couldn't be were converted into resource controls administered on a per process basis using standard Solaris administrative tools. Hopefully Dave will blog at some point about this process, the decisions he made, and why.</p>
<p>There are, of course, always going to be exceptions to the above rules. We still have far too many documented /etc/system tunables in Solaris today, and there will always be some that are absolutely necessary. But our philosophy is focused around these principles, as illustrated by the following story from the discussion thread:</p>
<p><i>Indeed, one of the more amusing stories was a Platinum Beta customer
showing us some slideware from a certain company comparing their OS
against Solaris. The slides were discussing available tunables, and the
basic gist was something like:</i></p>
<p><i>"We used to have way fewer tunables than Solaris, but now we've caught
up and have many more than they do. Our OS rules!"</i></p>
<p>Needless to say, we thought they company was missing the point.</p>
<p class="tag">Tags: <a href="http://technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris</a>https://blogs.oracle.com/eschrock/entry/a_parting_mdb_challengeA parting MDB challengeeschrockhttps://blogs.oracle.com/eschrock/entry/a_parting_mdb_challenge
Fri, 1 Jul 2005 13:18:16 +0000OpenSolaris<p>Like most of Sun's US employees, I'll be taking the next week off for vacation. On top of that, I'll be back in my hometown in MA for the next few weeks, alternately working remotely and attending my brother's wedding. I'll leave you with an MDB challenge, this time much more involved than past "puzzles". I don't have any prizes lying around, but this one would certainly be worth one if I had anything to give.</p>
<p>So what's the task? To implement <a href="http://www.opensolaris.org/os/community/mdb/tips/">munges</a> as a dcmd. Here's the complete description:</p>
<p>Implement a new dcmd, <tt>::stacklist</tt>, that will walk all threads (or all threads within a specific process when given a proc_t address) and summarize the different stacks by frequency. By default, it should display output identical to 'munges':</p>
<pre>
&gt; ::stacklist
73 ################################## tp: fffffe800000bc80
swtch+0xdf()
cv_wait+0x6a()
taskq_thread+0x1ef()
thread_start+8()
38 ################################## tp: ffffffff82b21880
swtch+0xdf()
cv_wait_sig_swap_core+0x177()
cv_wait_sig_swap+0xb()
cv_waituntil_sig+0xd7()
lwp_park+0x1b1()
syslwp_park+0x4e()
sys_syscall32+0x1ff()
...
</pre>
<p>The first number is the frequency of the given stack, and the 'tp' pointer should be a representative thread of the group. The stacks should be organized by frequency, with the most frequent ones first. When given the '-v' option, the dcmd should print out <i>all</i> threads containing the given stack trace. For extra credit, the ability to walk all threads with a matching stack (<tt>addr::walk samestack</tt>) would be nice.</p>
<p>This is not an easy dcmd to write, at least when doing it correctly. The first key is to use as little memory as possible. This dcmd must be capable of being run within kmdb(1M), where we have limited memory available. The second key is to leverage existing MDB functionality without duplicating code. You should not be copying code from <tt>::findstack</tt> or <tt>::stack</tt> into your dcmd. Ideally, you should be able to invoke <tt>::findstack</tt> without worry about its inner workings. Alternatively, restructuring the code to share a common routine would also be acceptable.</p>
<p>This command would be hugely beneficial when examining system hangs or other "soft failures," where there is no obvious culprit (such as a panicking thread). Having this functionality in KMDB (where we cannot invoke 'munges') would make debugging a whole class of problems much easier. This is also a great RFE to get started with OpenSolaris. It is self contained, low risk, but non-trivial, and gets you familiar with MDB at the same time. Personally, I have always found the <a href="http://www.opensolaris.org/os/community/observability">observability tools</a> a great place to start working on Solaris, because the risk is low while still requiring (hence learning) internal knowledge of the kernel.</p>
<p>If you do manage to write this dcmd, please email me (Eric dot Schrock at sun dot com) and I will gladly be your sponsor to get it integrated into OpenSolaris. I might even be able to dig up a prize somewhere...</p>https://blogs.oracle.com/eschrock/entry/solaris_virtualizationVirtualization and OpenSolariseschrockhttps://blogs.oracle.com/eschrock/entry/solaris_virtualization
Sun, 26 Jun 2005 13:38:14 +0000OpenSolaris<p>There's actually a decent piece over at <a href="http://www.eweek.com/article2/0,1759,1830116,00.asp">eWeek</a> discussing the future of Xen and LAE (the project formerly known as Janus) on OpenSolaris. Now that our marketing folks are getting the right message out there about what we're trying to accomplish, I thought I'd follow up with a little technical background on virtualization and why we're investing in these different technologies. Keep in mind that these are my personal beliefs based on interactions with customers and other Solaris engineers. Any resemblance to a corporate strategy is purely coincidental ;-)</p>
<p>Before diving in, I should point out that this will be a rather broad coverage of virtualization strategies. For a more detailed comparison of Zones and Jails in particular, check out James Dickens' <a href="http://www.karrot-x.net/jamesd/jailVzone.html">Zones comparison chart</a>.
<h3>Benefits of Virtualization</h3>
<p>First off, virtualization is here to stay. Our customers need virtualization - it dramatically reduces the cost of deploying and maintaining multiple machines and applications. The success of companies such as <a href="http://www.vmware.com">VMWare</a> is proof enough that such a market exists, though we have been hearing it from our customers for a long time. What we find, however, is that customers are often confused about exactly what they're trying to accomplish, and companies try to pitch a single solution to virtualization problems without recognizing that more appropriate solutions may exist. The most common need for virtualization (as judged by our customer base) is application consolidation. Many of the larger apps have become so complex that they become a system in themselves - and often they don't play nicely with other applications on the box. So "one app per machine" has become the common paradigm. The second most common need is security, either for your application administrators or your developers. Other reasons certainly exist (rapid test environment deployment, distributed system simulation, etc), but these are the two primary ones.</p>
<p>So what does virtualization buy you? It's all about reducing costs, but there are really two types of cost associated with running a system:</p>
<ol>
<li><b>Hardware costs</b> - This includes the cost of the machine, but also the costs associated with running that machine (power, A/C).</li>
<li><b>Software management costs</b> - This includes the cost of deploying new machines, and upgrading/patching software, and observing software behavior.</li>
</ol>
<p>As we'll see, different virtualization strategies provide different qualities of the above savings.</p>
<h3>Hardware virtualization</h3>
<p>One of the most well-established forms of virtualization, the most common examples today are <a href="http://docs.sun.com/source/806-3509-12/2__Domains.html#pgfId-539801">Sun Domains</a> and <a href="http://expertanswercenter.techtarget.com/eac/knowledgebaseAnswer/0,295199,sid63_gci970997,00.html">IBM Logical Partitions</a>. In each case, the hardware is responsible for dividing existing resources in such a way as to present multiple machines to the user. This has the advantage of requiring no software layer, no performance impact, and hardware fault isolation. The downside to this is that it requires specialized hardware that is extremely expensive, and provides zero benefit for reducing software management costs.</p>
<h3>Software machine virtualization</h3>
<p>This approach is probably the one most commonly associated with the term
"virtualization". In this scheme, a software layer is created which allows
multiple OS instances to run on the same hardware. The most commercialized
versions are <a href="http://www.vmware.com">VMware</a> and <a
href="http://www.microsoft.com/windows/virtualpc/default.mspx">Virtual PC</a>,
but other projects exist (such as <a
href="http://fabrice.bellard.free.fr/qemu/">qemu</a> and <a
href="http://pearpc.sourceforge.net/">PearPC</a>). Typically, they require a
"host" operating system as well as multiple "guests" (although VMware ESX server
runs a custom kernel as the host). While <a
href="http://www.cl.cam.ac.uk/Research/SRG/netos/xen/">Xen</a> uses a
paravitualization technique that requires changes to the guest OS, it is still
fundamentally a machine virtualization technique. And <a
href="http://user-mode-linux.sourceforge.net/">Usermode Linux</a> takes a
radically different approach, but accomplishes the basic same task.</p>
<p>In the end, this approach has similar strengths and weaknesses as the hardware assisted
virtualization. You don't have to buy expensive special-purpose hardware, but
you give up the hardware fault isolation and often sacrifice performance (Xen's
approach lessens this impact, but its still visible). But most importantly, you
still don't save any costs associated with software management - administering
software on 10 virtual machines is just as expensive as administering 10
separate machines. And you have no visibility into what's happening within the
virtual machine - you may be able to tell that Xen is consuming 50% of your CPU,
but you can't tell why unless you log into the virtual system itself.</p>
<h3>Software application virtualization</h3>
<p>On the grand scale of virtualization, this ranks as the "least virtualized".
With this approach, the operating system uses various tricks and techniques to
present an alternate view of the machine. This can range from simple
<tt>chroot(1)</tt>, to <a
href="http://www.awprofessional.com/articles/article.asp?p=366888&seqNum=9&rl=1">BSD
Jails</a>, to <a href="http://www.sun.com/bigadmin/content/zones">Solaris
Zones</a>. Each of these provide a more complete OS view with varying degrees
of isolation. While Zones is the most complete and the most secure, they all
use the same fundamental idea of a single operating system presenting an
"alternate reality" that appears to be a complete system at the application
level. The upcoming Linux Application Environment on OpenSolaris will take this
approach by leveraging Zones and emulating Linux at the system call layer.</p>
<p>The most significant downside to this approach is the fact there is a single kernel. You cannot run different operating systems (though LAE will add an interesting twist), and the "guest" environments have limited access to hardware facilities. On the other hand, this approach results in <i>huge</i> savings on the software management front. Because applications are still processes within the host environment, you have total visibility into what is happening within each guest, using standard operating system tools, as well as manage them as you would any other processes, using standard resource management tools. You can deploy, patch, and upgrade software from a single point without having to physically log into each machine. While not all applications will run in such a reduced environment, those that do will be able to benefit from vastly simplified software management. This approach also has the added bonus that it tends to make better use of shared resources. In Zones, for example, the most common configuration includes a shared /usr directory, so that no additional disk space is needed (and only one copy of each library needs to be resident in memory).</p>
<h3>OpenSolaris virtualization in the future</h3>
<p>So what does this all mean for OpenSolaris? Why are we continuing to pursue Zones, LAE, and Xen? The short answer is because "our customers want us to." And hopefully, from what's been said above, it's obvious that there is no one virtualization strategy that is correct for everyone. If you want to consolidate servers running a variety of different operating systems (including older versions of Solaris), then Xen is probably the right approach. If you want to consolidate machines running Solaris applications, then Zones is probably your best bet. If you require the ability to survive hardware faults between virtual machines, then domains is the only choice. If you want to take advantage of Solaris FMA and performance, but still want to run the latest and greatest from RedHat with support, then Xen is your option. If you have 90% of your applications on Solaris, and you're just missing that one last app, then LAE is for you. Similarly, if you have a Linux app that you want to debug with DTrace, you can leverage LAE without having to port to Solaris first.</p>
<p>With respect to Linux virtualization in particular, we are <b>always</b> going to pursue ISV certification first. No one at Sun wants you to run Oracle under LAE or Xen. Given the choice, we will always aggressively pursue ISVs to do a native port to Solaris. But we understand that there is an entire ecosystem of applications (typically in-house apps) that just won't run on Solaris x86. We want users to have a choice between virtualization options, and we want all those options to be a fundamental part of the operating system.</p>
<p>I hope that helps clear up the grand strategy. There will always be people who disagree with this vision, but we honestly believe we're making the best choices for our customers.</p>
<p class="tag">Tags: <a href="http://technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris</a>
<a href="http://technorati.com/tag/Zones" rel="tag">Zones</a></p>
<hr/>
<p>You may note, that I failed to mention cross-architecture virtualization. This is most common at the system level (like PearPC), but application-level solutions do exist (including Apple's upcoming Rosetta). This type of virtualization simply doesn't factor into our plans, yet, and still falls under the umbrella of one of the broad virtualization types.</p>
<p>I also apologize for any virtualization projects out there that I missed. There are undoubtedly many more, but the ones mentioned above serve to illustrate my point.</p>
https://blogs.oracle.com/eschrock/entry/fun_source_code_factsFun source code factseschrockhttps://blogs.oracle.com/eschrock/entry/fun_source_code_facts
Sat, 25 Jun 2005 12:04:30 +0000OpenSolaris<p>A while ago, for my own amusement, I went through the Solaris source base and searched for the source files with the most lines. For some unknown reason this popped in my head yesterday so I decided to try it again. Here are the top 10 longest files in OpenSolaris:</p>
<table>
<tr><th>Length</th><th>Source File</th></tr>
<tr><td>29944</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/io/scsi/targets/sd.c">usr/src/uts/common/io/scsi/targets/sd.c</a></td></tr>
<tr><td>25920</td><td><font color="#cccccc">[closed]</font></td></tr>
<tr><td>25429</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/inet/tcp/tcp.c">usr/src/uts/common/inet/tcp/tcp.c</a></td></tr>
<tr><td>22789</td><td><font color="#cccccc">[closed]</font></td></tr>
<tr><td>16954</td><td><font color="#cccccc">[closed]</font></td></tr>
<tr><td>16339</td><td><font color="#cccccc">[closed]</font></td></tr>
<tr><td>15667</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/fs/nfs/nfs4_vnops.c">usr/src/uts/common/fs/nfs4_vnops.c</a></td></tr>
<tr><td>14550</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/sfmmu/vm/hat_sfmmu.c">usr/src/uts/sfmmu/vm/hat_sfmmu.c</a></td></tr>
<tr><td>13931</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/dtrace/dtrace.c">usr/src/uts/common/dtrace/dtrace.c</a></td></tr>
<tr><td>13027</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/sun4u/starfire/io/idn_proto.c">usr/src/uts/sun4u/starfire/io/idn_proto.c</a></td></tr>
</table>
<p>You can see some of the largest files are still closed source. Note that the length of the file doesn't necessarily indicate anything about the quality of the code, it's more just idle curiosity. Knowing the quality of online journalism these days, I'm sure this will get turned into "Solaris source reveals completely unmaintable code" ...</p>
<p>After looking at this, I decided a much more interesting question was "which source files are the most commented?" To answer this question, I ran evey source file through a script I found that counts the number of commented lines in each file. I filtered out those files that were less than 500 lines long, and ran the results through another script to calculate the percentage of lines that were commented. Lines which have a comment along with source are considered a commented line, so some of the ratios were quite high. I filtered out those files which were mostly tables (like <a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/io/uwidth.c">uwidth.c</a>), as these comments didn't really count. I also ignored header files, because they tend to be far more commented that the implementation itself. In the end I had the following list:</p>
<table><tr><th>Percentage</th><th>File</th></tr>
<tr><td>62.9%</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/cmd/cmd-inet/usr.lib/mipagent/snmp_stub.c">usr/src/cmd/cmd-inet/usr.lib/mipagent/snmp_stub.c</a></td></tr>
<tr><td>58.7%</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/cmd/sgs/libld/amd64/amd64unwind.c">usr/src/cmd/sgs/libld/amd64/amd64unwind.c</a></td></tr>
<tr><td>58.4%</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/lib/libtecla/common/expand.c">usr/src/lib/libtecla/common/expand.c</a></td></tr>
<tr><td>56.7%</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/cmd/lvm/metassist/common/volume_nvpair.c">usr/src/cmd/lvm/metassist/common/volume_nvpair.c</a></td></tr>
<tr><td>56.6%</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/lib/libtecla/common/cplfile.c">usr/src/lib/libtecla/common/cplfile.c</a></td></tr>
<tr><td>55.6%</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/lib/libc/port/gen/mon.c">usr/src/lib/libc/port/gen/mon.c</td></tr>
<tr><td>55.4%</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/lib/libadm/common/devreserv.c">usr/src/lib/libadm/common/devreserv.c</td></tr>
<tr><td>55.1%</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/lib/libtecla/common/getline.c">usr/src/lib/libtecla/common/getline.c</a></td></tr>
<tr><td>54.5%</td><td><font color="#cccccc">[closed]</font></td></tr>
<tr><td>54.3%</td><td><a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/io/ib/ibtl/ibtl_mem.c">usr/src/uts/common/io/ib/ibtl/ibtl_mem.c</a></td></tr>
</table>
<p>Now, when I write code I tend to hover in the 20-30% comments range (my best of those in the gate is <a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/fs/gfs.c">gfs.c</a>, which with <a href="http://blogs.sun.com/dep">Dave's</a> help is 44% comments). Some of the above are rather over-commented (especially snmp_sub.c, which likes to repeat comments above and within functions).</p>
<p>I found this little experiment interesting, but please don't base any conclusions on these results. They are for entertainment purposes only.</p>
<p>Technorati Tag: <a href="http://technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris</a></p> https://blogs.oracle.com/eschrock/entry/mdb_puzzle_take_twoMDB puzzle, take twoeschrockhttps://blogs.oracle.com/eschrock/entry/mdb_puzzle_take_two
Fri, 24 Jun 2005 01:11:38 +0000OpenSolaris<p>Since Bryan solved my <a href="http://blogs.sun.com/roller/page/eschrock?entry=mdb_puzzle">last puzzle</a> a little too quickly, this post will serve as a followup puzzle that may or may not be easier. All I know is that Bryan is ineligible this time around ;-)</p>
<p>Once again, the rules are simple. The solution must be a single line dcmd that produces precise output without any additional steps or post processing. For this puzzle, you're actually allowed two commands: one for your dcmd, and another for '::run'. For this puzzle, we'll be using the following test program:</p>
<pre>
#include <unistd.h>
int
main(int argc, char \*\*argv)
{
int i;
srand(time(NULL));
for (i = 0; i < 100; i++)
write(rand() % 10, NULL, 0);
return (0);
}
</pre>
<p>The puzzle itself demonstrates how conditional breakpoints can be implemented on top of existing functionality:</p>
<p><b>Stop the test program on entry to the write() system call only when the file descriptor number is 7</b></p>
<p>I thought this one would be harder than the last, but now I'm not so sure, especially once you absorb some of the finer points from the last post.</p>
<p>Technorati Tag: <a href="http://technorati.com/tag/mdb" rel="tag">MDB</a></p>https://blogs.oracle.com/eschrock/entry/mdb_puzzleMDB puzzleeschrockhttps://blogs.oracle.com/eschrock/entry/mdb_puzzle
Thu, 23 Jun 2005 22:57:17 +0000OpenSolaris<p>On a lighter note, I'd thought I post an "MDB puzzle" for the truly masochistic out there. I was going to post two, but the second one was just way too hard, and I was having a hard time finding a good test case in userland. You can check out how we hope to make this better over at the <a href="http://www.opensolaris.org/os/community/mdb/future">MDB community</a>. Unfortunately I don't have anything cool to give away, other than my blessing as a truly elite MDB hacker. Of course, if you get this one right I might just have to post the second one I had in mind...</p>
<p>The rules are simple. You can only use a single line command in 'mdb -k'. You cannot use shell escapes (!). Your answer must be precise, without requiring post-processing through some other utility. Leaders of the <a href="http://www.opensolaris.org/os/community/mdb">MDB community</a> and their relatives are ineligible, though other Sun employees are welcome to try. And now, the puzzle:</p>
<p><b>Print out the current working directory of every process with an effective user id of 0.</b></p>
<p>Should be simple, right? Well, make sure you go home and study your MDB pipelines, because you'll need some clever tricks to get this one just right...</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris</a> <a href="http://technorati.com/tag/mdb" rel="tag">MDB</a></p> https://blogs.oracle.com/eschrock/entry/adding_a_module_to_opensolarisAdding a kernel module to OpenSolariseschrockhttps://blogs.oracle.com/eschrock/entry/adding_a_module_to_opensolaris
Sun, 19 Jun 2005 14:01:18 +0000OpenSolaris<p>On opening day, I chose to post an entry on <a
href="http://blogs.sun.com/roller/page/eschrock/20050614#how_to_add_a_system">adding
a system call</a> to OpenSolaris. Considering the feedback, I thought I'd
continue with brief "How-To add <thing> to OpenSolaris" documents for a while.
There's a lot to choose from here, so I'll just pick them off as quick as I can.
Todays topic as adding a new kernel module to OpenSolaris.</p>
<p>For the sake of discussion, we will be adding a new module that does nothing
apart from print a message on load and unload. It will be architecture-neutral,
and be distributed as part of a separate package (to give you a taste of our
packaging system). We'll continue my narcissistic
tradition and name this the "schrock" module.</p>
<h3>1. Adding source</h3>
<p>To begin, you must put your source somewhere in the tree. It must be put
somewhere under <a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common">usr/src/uts/common</a>,
but exactly where depends on the type of module. Just about the only real rule
is that filesystems go in the "fs" directory, but other than that there are no
real rules. The bulk of the modules live in the "io" directory, since the
majority of modules are drivers of some kind. For now, we'll put 'schrock.c' in
the "io" directory:</p>
<pre>
#include &lt;sys/modctl.h&gt;
#include &lt;sys/cmn_err.h&gt;
static struct modldrv modldrv = {
&mod_miscops,
"schrock module %I%",
NULL
};
static struct modlinkage modlinkage = {
MODREV_1, (void \*)&modldrv, NULL
};
int
_init(void)
{
cmn_err(CE_WARN, "OpenSolaris has arrived");
return (mod_install(&modlinkage));
}
int
_fini(void)
{
cmn_err(CE_WARN, "OpenSolaris has left the building");
return (mod_remove(&modlinkage));
}
int
_info(struct modinfo \*modinfop)
{
return (mod_info(&modlinkage, modinfop));
}
</pre>
<p>The code is pretty simple, and is basically the minimum needed to add
a module to the system. You notice we use 'mod_miscops' in our modldrv.
If we were adding a device driver or filesystem, we would be using a
different set of linkage structures.</p>
<h3>2. Creating Makefiles</h3>
<p>We must add two Makefiles to get this building:</p>
<pre>
usr/src/uts/intel/schrock/Makefile
usr/src/uts/sparc/schrock/Makefile
</pre>
<p>With contents similar to the following:</p>
<pre>
UTSBASE = ../..
MODULE = schrock
OBJECTS = $(SCHROCK_OBJS:%=$(OBJS_DIR)/%)
LINTS = $(SCHROCK_OBJS:%.o=$(LINTS_DIR)/%.ln)
ROOTMODULE = $(ROOT_MISC_DIR)/$(MODULE)
include $(UTSBASE)/intel/Makefile.intel
ALL_TARGET = $(BINARY)
LINT_TARGET = $(MODULE).lint
INSTALL_TARGET = $(BINARY) $(ROOTMODULE)
CFLAGS += $(CCVERBOSE)
.KEEP_STATE:
def: $(DEF_DEPS)
all: $(ALL_DEPS)
clean: $(CLEAN_DEPS)
clobber: $(CLOBBER_DEPS)
lint: $(LINT_DEPS)
modlintlib: $(MODLINTLIB_DEPS)
clean.lint: $(CLEAN_LINT_DEPS)
install: $(INSTALL_DEPS)
include $(UTSBASE)/intel/Makefile.targ
</pre>
<h3>3. Modifying existing Makefiles</h3>
<p>There are two remaining Makefile chores before we can continue. First, we have
to add the set of files to <a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/Makefile.files">usr/src/uts/common/Makefile.files</a>:</p>
<pre>
KMDB_OBJS += kdrv.o
<b>SCHROCK_OBJS += schrock.o</b>
BGE_OBJS += bge_main.o bge_chip.o bge_kstats.o bge_log.o bge_ndd.o \\
bge_atomic.o bge_mii.o bge_send.o bge_recv.o
</pre>
<p>If you had created a subdirectory for your module instead of placing it in
"io", you would also have to add a set of rules to <a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/Makefile.rules">usr/src/uts/common/Makefile.rules</a>.
If you need to do this, make sure you get both the object targets <i>and</i> the
lint targets, or you'll get build failures if you try to run lint.</p>
<p>You'll also need to modify the <a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/intel/Makefile.intel">usr/src/uts/intel/Makefile.intel</a>
file, as well as the corresponding SPARC version:</p>
<pre>
MISC_KMODS += usba usba10
MISC_KMODS += zmod
<b>MISC_KMODS += schrock</b>
#
# Software Cryptographic Providers (/kernel/crypto):
#
</pre>
<h3>4. Creating a package</h3>
<p>As mentioned previously, we want this module to live in its own package. We
start by creating <tt>usr/src/pkgdefs/SUNWschrock</tt> and adding it to the list
of COMMON_SUBDIRS in <a
href="http://cvs.opensolaris.org/source/xref/usr/src/pkgdefs/Makefile">usr/src/pkgdefs/Makefile</a>:</p>
<pre>
SUNWsasnm \\
SUNWsbp2 \\
<b> SUNWschrock \\</b>
SUNWscpr \\
SUNWscpu \\
</pre>
<p>Next, we have to add a skeleton package system. Since we're only adding a
miscellaneous module and not a full blown driver, we only need a simple
skeleton. First, there's the Makefile:</p>
<pre>
include ../Makefile.com
.KEEP_STATE:
all: $(FILES)
install: all pkg
include ../Makefile.targ
</pre>
<p>A 'pkgimfo.tmpl' file:</p>
<pre>
PKG=SUNWschrock
NAME="Sample kernel module"
ARCH="ISA"
VERSION="ONVERS,REV=0.0.0"
SUNW_PRODNAME="SunOS"
SUNW_PRODVERS="RELEASE/VERSION"
SUNW_PKGVERS="1.0"
SUNW_PKGTYPE="root"
MAXINST="1000"
CATEGORY="system"
VENDOR="Sun Microsystems, Inc."
DESC="Sample kernel module"
CLASSES="none"
HOTLINE="Please contact your local service provider"
EMAIL=""
BASEDIR=/
SUNW_PKG_ALLZONES="true"
SUNW_PKG_HOLLOW="true"
</pre>
<p>And 'prototype_com', 'prototype_i386', and 'prototype_sparc' (elided) files:</p>
<pre>
# prototype_i386
!include prototype_com
d none kernel/misc/amd64 755 root sys
f none kernel/misc/amd64/schrock 755 root sys
</pre>
<pre>
# prototype_com
i pkginfo
d none kernel 755 root sys
d none kernel/misc 755 root sys
f none kernel/misc/schrock 755 root sys
</pre>
<h3>5. Putting it all together</h3>
<p>If we pkgadd our package, or BFU to the resulting archives, we can see our
module in action:</p>
<pre>
halcyon# modload /kernel/misc/schrock
Jun 19 12:43:35 halcyon schrock: WARNING: OpenSolaris has arrived
halcyon# modunload -i 197
Jun 19 12:43:50 halcyon schrock: WARNING: OpenSolaris has left the building
</pre>
<p>This process is common to all kernel modules (though packaging is simpler for
those combined in SUNWckr, for example). Things get a little more complicated
and a little more specific when you begin to talk about drivers or filesystems
in particular. I'll try to create some simple howtos for those as
well.</p>
<p>Technorati Tag: <a href="http://technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris</a></p>
https://blogs.oracle.com/eschrock/entry/observability_in_opensolarisObservability in OpenSolariseschrockhttps://blogs.oracle.com/eschrock/entry/observability_in_opensolaris
Fri, 17 Jun 2005 17:39:06 +0000OpenSolaris<p>Just a heads up that we've formed a new OpenSolaris <a href="http://www.opensolaris.org/os/community/observability">Observability community</a>. There's not much there night now, but I encourage to head over and check out what OpenSolaris has to offer. Or come to the discussion forum and gripe about what features we're still missing. Topics covered include process, system, hardware, and post mortem observability. We'll be adding much more content as soon as we can.</p>
<p>Technorati Tag: <a href="http://technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris</a></p>https://blogs.oracle.com/eschrock/entry/gdb_to_mdb_migration_partGDB to MDB Migration, Part Twoeschrockhttps://blogs.oracle.com/eschrock/entry/gdb_to_mdb_migration_part
Fri, 17 Jun 2005 10:22:46 +0000OpenSolaris<p>So talking to <a href="http://www.cuddletech.com/">Ben</a> last night convinced me I needed to finish up the <a href="http://blogs.sun.com/roller/page/eschrock/20050510#gdb_to_mdb">GDB to MDB reference</a> that I started last month. So here's part two.
<table border=0 cellpadding=5>
<tr><td width=30/><th width=130 align=left>GDB</th><th width=130
align=left>MDB</th><th align=left>Description</th></tr></tr>
<tr><td colspan=4><h3>Program Stack</h3></td></tr>
<tr><td/><td valign=top><tt>backtrace <i>n</i></tt></td><td
valign=top><tt>::stack<br/>$C</tt></td>
<td valign=top>Display stack backtrace for the current thread</td></tr>
<tr><td/><td valign=top>-</td><td valign=top><tt><i>thread</i>::findstack
-v</td>
<td valign=top>Display a stack for a given thread. In the kernel, <i>thread</i>
is the address of the <tt>kthread_t</tt>. In userland, it's the thread
identifier.</td></tr>
<tr><td/><td valign=top><tt>info ...</tt></td><td valign=top><tt>-</tt></td>
<td valign=top>Display information about the current frame. MDB doesn't support
the debugging data necessary to maintain the frame abstraction.</td></tr>
<tr><td colspan=4><h3>Execution Control</h3></td></tr>
<tr><td/><td valign=top><tt>continue<br/>c</tt></td><td valign=top><tt>:c</tt></td>
<td valign=top>Continue target.</td></tr>
<tr><td/><td valign=top><tt>stepi<br/>si</tt></td><td valign=top><tt>::step<br/>]</tt></td>
<td valign=top>Step to the next machine instruction. MDB does not support
stepping by source lines.</td></tr>
<tr><td/><td valign=top><tt>nexti<br/>ni</tt></td><td valign=top><tt>::step
over<br/>[</tt></td>
<td valign=top>Step over the next machine instruction, skipping any function
calls.</td></tr>
<tr><td/><td valign=top><tt>finish</tt></td><td valign=top><tt>::step
out</tt></td>
<td valign=top>Continue until returning from the current frame.</td></tr>
<tr><td/><td valign=top><tt>jump <i>\*address</i></tt></td><td
valign=top><tt>address&gt<i>reg</i></tt></td>
<td valign=top>Jump to the given location. In MDB, <i>reg</i> depends on your
platform. For SPARC it's 'pc', for i386 its 'eip', and for amd64 it's
'rip'.</td></tr>
<tr><td colspan=4><h3>Display</h3></td></tr>
<tr><td/><td valign=top><tt>print <i>expr</i></tt></td><td
valign=top><tt><i>addr</i>::print
<i>expr</i></tt></td>
<td valign=top>Print the given expression. In GDB you can specify variable
names as well as addresses. For MDB, you give a particular address and then
specify the type to display (which can include dereferencing of members,
etc).</td></tr>
<tr><td/><td valign=top><tt>print /<i>f</i></tt></td><td
valign=top><tt>addr/<i>f</i></tt></td>
<td valign=top>Print data in a precise format. See <tt>::formats</tt> for a
list of MDB formats.</td></tr>
<tr><td/><td valign=top><tt>disassem <i>addr</i></tt></td><td
valign=top><tt><i>addr</i>::dis</tt></td>
<td valign=top>Dissasemble text at the given address, or the current PC if no
address is specified</td></tr>
</table>
<p>This is just a primer. Both programs support a wide variety of additional
options. Running 'mdb -k', you can quickly see just how many commands are out
there:</p>
<pre>
> ::dcmds ! wc -l
385
> ::walkers ! wc -l
436
</pre>
<p>One helpful trick is <tt>::dcmds ! grep <i>thing</i></tt>, which searches the
description of each command. Good luck, and join the discussion over at the
OpenSolaris <a href="http://www.opensolaris.org/os/community/mdb">MDB
community</a> if you have any questions or tips of your own.</p>
<p>
Technorati tag: <a href="http://technorati.com/tag/MDB" rel="tag">MDB</a><br/>
Technorati tag: <a href="http://technorati.com/tag/OpenSolaris"
rel="tag">OpenSolaris</a><br/>
Technorati tag: <a href="http://technorati.com/tag/Solaris"
rel="tag">Solaris</a>
</p>https://blogs.oracle.com/eschrock/entry/how_to_add_a_systemHow to add a system call to OpenSolariseschrockhttps://blogs.oracle.com/eschrock/entry/how_to_add_a_system
Tue, 14 Jun 2005 08:03:00 +0000OpenSolaris<p> When I first started in the Solaris group, I was faced with two equally
difficult tasks: learning the development model, and understanding the source
code. For both these tasks, the recommended method is usually picking a small
bug and working through the process. For the curious, the first bug I putback
to ON was <a
href="http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4912227">4912227</a>
(ptree call returns zero on failure), a simple bug with near zero risk. It
was the first step down a very long road.</p>
<p>As a another first step, someone suggested adding a very simple system call to the
kernel. This turned out to be a whole lot harder than one would expect, and has
so many subtle aspects that experienced Solaris engineers (myself included)
<i>still</i> miss some of the necessary changes. With that in mind, I thought a
reasonable first OpenSolaris blog would be describing exactly how to add a new
system call to the kernel.</p>
<p>For the purposes of this post, we will assume that it's a simple system call
that lives in the generic kernel code, and we'll put the code into an existing
file to avoid having to deal with Makefiles. The goal is to print an arbitrary
message to the console whenever the system call is issued.</p>
<h3>1. Picking a syscall number</h3>
<p>Before writing any real code, we first have to pick a number that will
represent our system call. The main source of documentation here is
<a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/syscall.h">syscall.h</a>,
which describes all the available system call numbers, as well as which ones are
reserved. The maximum number of syscalls is currently 256 (NSYSCALL), which
doesn't leave much space for new ones. This could theoretically be extended - I
believe the hard limit is in the size of <tt>sysset_t</tt>, whose 16 integers
must be able to represent a complete bitmask of all system calls. This puts our
actual limit at 16\*32, or 512, system calls. But for the purposes of our
tutorial, we'll pick system call number 56, which is currently unused. For my
own amusement, we'll name our (my?) system call 'schrock'. So first we add the
following line to <tt>syscall.h</tt></p>
<p>
<pre>
#define SYS_uadmin 55
<b>#define SYS_schrock 56</b>
#define SYS_utssys 57
</pre>
</p>
<h3>2. Writing the syscall handler</h3>
<p>Next, we have to actually add the function that will get called when we
invoke the system call. What we should really do is add a new file
<tt>schrock.c</tt> to <a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/syscall">usr/src/uts/common/syscall</a>,
but I'm trying to avoid Makefiles. Instead, we'll just stick it in <a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/syscall/getpid.c">getpid.c</a>:</p>
<p>
<pre>
#include &lt;sys/cmn_err.h&gt;
int
schrock(void \*arg)
{
char buf[1024];
size_t len;
if (copyinstr(arg, buf, sizeof (buf), &amp;len) != 0)
return (set_errno(EFAULT));
cmn_err(CE_WARN, "%s", buf);
return (0);
}
</pre>
</p>
<p>Note that declaring a buffer of 1024 bytes on the stack is a <i>very</i> bad
thing to do in the kernel. We have limited stack space, and a stack overflow
will result in a panic. We also don't check that the length of the string was
less than our scratch space. But this will suffice for illustrative purposes.
The <a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/os/printf.c#cmn_err">cmn_err()</a>
function is the simplest way to display messages from the kernel.</p>
<h3>3. Adding an entry to the syscall table</h3>
<p>We need to place an entry in the system call table. This table lives in <a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/os/sysent.c">sysent.c</a>,
and makes heavy use of macros to simplify the source. Our system call takes a
single argument and returns an integer, so we'll need to use the
<tt>SYSENT_CI</tt> macro. We need
to add a prototype for our syscall, and add an entry to the <tt>sysent</tt> and
<tt>sysent32</tt> tables:</p>
<p>
<pre>
int rename();
void rexit();
<b>int schrock();</b>
int semsys();
int setgid();
<i>/\* ... \*/</i>
/\* 54 \*/ SYSENT_CI("ioctl", ioctl, 3),
/\* 55 \*/ SYSENT_CI("uadmin", uadmin, 3),
<b> /\* 56 \*/ SYSENT_CI("schrock", schrock, 1),</b>
/\* 57 \*/ IF_LP64(
SYSENT_2CI("utssys", utssys64, 4),
SYSENT_2CI("utssys", utssys32, 4)),
<i>/\* ... \*/</i>
/\* 54 \*/ SYSENT_CI("ioctl", ioctl, 3),
/\* 55 \*/ SYSENT_CI("uadmin", uadmin, 3),
<b> /\* 56 \*/ SYSENT_CI("schrock", schrock, 1),</b>
/\* 57 \*/ SYSENT_2CI("utssys", utssys32, 4),
</pre>
</p>
<h3>4. /etc/name_to_sysnum</h3>
<p>At this point, we could write a program to invoke our system call, but the
point here is to illustrate <i>everything</i> that needs to be done to integrate
a system call, so we can't ignore the little things. One of these little things
is <tt>/etc/name_to_sysnum</tt>, which provides a mapping between system call
names and numbers, and is used by <tt>dtrace(1M)</tt>, <tt>truss(1)</tt>, and
friends. Of course, there is one version for x86 and one for SPARC, so you will
have to add the following lines to both the
<a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/intel/os/name_to_sysnum">intel</a>
and
<a
href="http://cvs.opensolaris.org/source/xref/usr/src/uts/sparc/os/name_to_sysnum">SPARC</a>
versions:</p>
<p>
<pre>
ioctl 54
uadmin 55
<b>schrock 56</b>
utssys 57
fdsync 58
</pre>
</p>
<h3>5. truss(1)</h3>
<p>Truss does fancy decoding of system call arguments. In order to do this, we
need to maintain a table in truss that describes the type of each argument for
every syscall. This table is found in <a
href="http://cvs.opensolaris.org/source/xref/usr/src/cmd/truss/systable.c">systable.c</a>.
Since our syscall takes a single string, we add the following entry:</p>
<p>
<pre>
{"ioctl", 3, DEC, NOV, DEC, IOC, IOA}, /\* 54 \*/
{"uadmin", 3, DEC, NOV, DEC, DEC, DEC}, /\* 55 \*/
<b>{"schrock", 1, DEC, NOV, STG}, /\* 56 \*/</b>
{"utssys", 4, DEC, NOV, HEX, DEC, UTS, HEX}, /\* 57 \*/
{"fdsync", 2, DEC, NOV, DEC, FFG}, /\* 58 \*/
</pre>
</p>
<p>Don't worry too much about the different constants. But be sure to read up
on the truss source code if you're adding a complicated system call.</p>
<h3>6. proc_names.c</h3>
<p>This is the file that gets missed the most often when adding a new syscall.
Libproc uses the table in <a
href="http://cvs.opensolaris.org/source/xref/usr/src/lib/libproc/common/proc_names.c">proc_names.c</a>
to translate between system call numbers and names. Why it doesn't make use of
<tt>/etc/name_to_sysnum</tt> is anybody's guess, but for now you have to update
the <tt>systable</tt> array in this file:
</p>
<p>
<pre>
"ioctl", /\* 54 \*/
"uadmin", /\* 55 \*/
<b> "schrock", /\* 56 \*/</b>
"utssys", /\* 57 \*/
"fdsync", /\* 58 \*/
</pre>
</p>
<h3>7. Putting it all together</h3>
<p>Finally, everything is in place. We can test our system call with a simple
program:</p>
<p>
<pre>
#include &lt;sys/syscall.h&gt;
int
main(int argc, char \*\*argv)
{
syscall(SYS_schrock, "OpenSolaris Rules!");
return (0);
}
</pre>
</p>
<p>If we run this on our system, we'll see the following output on the
console:</p>
<p>
<pre>
June 14 13:42:21 halcyon genunix: WARNING: OpenSolaris Rules!
</pre>
</p>
<p>Because we did all the extra work, we can actually observe the behavior using
<tt>truss(1)</tt>, <tt>mdb(1)</tt>, or <tt>dtrace(1M)</tt>. As you can see,
adding a system call is not as easy as it should be. One of the ideas that has
been floating around for a while is the Grand Unified Syscall(tm) project, which
would centralize all this information as well as provide type information for
the DTrace syscall provider. But until that happens, we'll have to deal with
this process.</p>
<p>
Technorati Tag: <a
href="http://www.technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris</a>
<br>
Technorati Tag: <a
href="http://www.technorati.com/tag/Solaris" rel="tag">Solaris</a>
</p>https://blogs.oracle.com/eschrock/entry/fisl_final_dayFISL Final Dayeschrockhttps://blogs.oracle.com/eschrock/entry/fisl_final_day
Sat, 4 Jun 2005 18:51:07 +0000OpenSolaris<p>The last day of <a href="http://fisl.softwarelivre.org">FISL</a> has come and gone, thankfully. I'm completely drained, both physically and mentally. As you can probably tell from the comments on yesterday's blog entry, we had quite a night out last night in Porto Alegre. I didn't stay out quite as late as some of the Brazil guys, but Ken and I made it back in time to catch about 4 hours of sleep before heading off to the conference. Thankfully I remembered to set my alarm, otherwise I probably would have ended up in bed until the early afternoon. The full details of the night are better told in person...</p>
<p>This last day was significantly quieter than previous days. With the conference winding down, I assume that many people took off early. Most of our presentations today were to an audience of 2 or 3 people, and we even had to cancel some of the early ones as no one was there. I managed to give presentations for Performance, Zones, and DTrace, despite my complete lack of sleep. The DTrace presentation was particularly rough because it's primarily demo-driven, with no set plan. This turns out to be rather difficult after a night of no sleep and a few too many <a href="http://www.maria-brazil.org/caipirinha.htm">caipirinhas</a>.</p>
<p>The highlight of the day was when a woman (stunningly beautiful, of course) came up to me while I was sitting in one of the chairs and asked to take a picture with me. We didn't talk at all, and I didn't know who she was, but she seemed psyched to be getting her picture taken with someone from Sun. I just keep telling myself that it was my stunning good looks that resulted in the picture, not my badge saying "Sun Microsystems". I can dream, can't I?</p>
<p>Tomorrow begins the 24 hours of travelling to get me back home. I can't wait to get back to my own apartment and a normal lifestyle.</p>https://blogs.oracle.com/eschrock/entry/fisl_day_3FISL Day 3eschrockhttps://blogs.oracle.com/eschrock/entry/fisl_day_3
Fri, 3 Jun 2005 17:13:04 +0000OpenSolaris<p>The exhaustion continues to increase. Today I did 3 presentations: DTrace, Zones, and FMA (which turned into OpenSolaris). Every one took up the full hour allotted. And tomorrow I'm going to add a Solaris performance presentation, to bring the grand total to 4 hours of presentations. Given how bad the acoustics are on the exposition floor, my goal is to lose my voice by the end of the night. So far, I've settled into a schedule: wake up around 7:00, check email, work on slides, eat breakfast, then get to the conference around 8:45. After a full day of talking and giving presentations, I get back to the hotel around 7:45 and do about an hour of work/email before going out to dinner. We get back from dinner around 11:30, at which point I get to blogging and finishing up some work. Eventaully I get to sleep around 1:00, at which point I have to do the whole thing the next day. Thank god tomorrow is the end, I don't know how much more I can take.</p>
<p>Today's highlight was when Dimas (from Sun Brazil) began an impromptu <a href="http://www.sun.com/software/looking_glass/">Looking Glass</a> demo towards the end of the day. He ended up overflowing our booth with at least 40 people for a solid hour before the commotion started to die down. Those of us sitting in the corner were worried we'd have to lave to make room. Our Solaris presentations hit 25 or so people, but never so many for so long. The combination of cool eye candy and a native Portuguese speaker really helped out (though most people probably couldn't hear him anyway).</p>
<p>Other highlights included hanging out with the folks at <a href="http://www.codebreakers.com.br">CodeBreakers</a>, who really seem to dig Solaris (Thiago had S10 installed on his laptop within half a day). We took some pictures with them (which Dave should post soon), and are going out for barbeque and drinks tonight with them and 100+ other open source Brazil folks. I also helped a few other people get Solaris 10 installed on their laptops (mostly just the "disable USB legacy support" problem). It's unbelievably cool to see the results of handing out Solaris 10 DVDs before even leaving the conference. The top Solaris presentations were understandably DTrace and Zones, though the booth was pretty well packed all day.</p>
<p>Let's hope the last day is as good as the rest. Here's to <i>Software Livre!</i></p>https://blogs.oracle.com/eschrock/entry/fisl_day_2FISL Day 2eschrockhttps://blogs.oracle.com/eschrock/entry/fisl_day_2
Thu, 2 Jun 2005 15:44:07 +0000OpenSolaris<p>Another day at <a href="http://fisl.softwarelivre.org">FISL</a>, another day full of presentations. Today we did mini-presentations every hour on the hour, most of which were very well attended. When we overlapped with the major keynote sessions, turnout tended to be low, but other than that it was very successful. We covered OpenSolaris, DTrace, FMA, SMF, Security, as well as a Java presentation (by Charlie, not Dave or myself). As usual, lots of great questions from the highly technical audience.</p>
<p>The highlight today was a great conversation with a group of folks very interested in starting an OpenSolaris users group in Brazil. Extremely nice group of guys, very interested in technology and helping OpenSolaris build a greater presence in Brazil (both through user groups and Solaris attendance at conferences). I have to say that after experiencing this conference and seeing the enthusiasm that everyone has for exciting technology and open source, I have to agree that Brazil is a great place to focus our OpenSolaris presence. Hopefully we'll see user groups pop up here as well as the rest of the world. We'll be doing everything we can to help from within Sun.</p>
<p>The other, more amusing, highlight of the day was during my DTrace demonstration. I needed an interesting java application to demonstrate the <tt>jstack()</tt> DTrace action, so I started up the only java application (apart from some internal Sun tools) that I use on a regular basis: Yahoo! Sports Fantasy Baseball StatTracker (the classic version, not the new flash one). I tried to explain that maybe I was trying to debug why the app was lying to me about Tejada going 0-2 so far in the Sox/Orioles game; really he should have hit two homers and I should be dominating this week's scores<sup>1</sup>. I was rather amused, but I think the cultural divide was a little too wide. Not only baseball, but <i>fantasy baseball</i>: I don't blame the audience at all.</p>
Technorati tags: <a href="http://technorati.com/tag/Solaris" rel="tag">Solaris</a>
<a href="http://technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris</a>
<hr/>
<p><sup>1</sup> This is clearly a lie. Despite any dreams of fantasy baseball domination, I would never root for my players in a game over the Red Sox. In the end, Ryan's 40.5 ERA was worth the bottom of the ninth comeback capped by Ortiz's 3-run shot.</p>https://blogs.oracle.com/eschrock/entry/live_from_brazilLive from Brazileschrockhttps://blogs.oracle.com/eschrock/entry/live_from_brazil
Tue, 31 May 2005 13:47:07 +0000OpenSolaris<p><a href="http://blogs.sun.com/dep">Dave Powell</a> and myself have arrived at <a href="http://fisl.softwarelivre.org">FISL</a>, an open source conference in Brazil, along with a crowd of other Sun folks. Dave and I (with introduction from Sun VP Tom Goguen) will be hosting a 4 hour OpenSolaris pre-event tomorrow, June 1st. We'll be talking about all the cool features available in OpenSolaris, as well as how Solaris development works today and how we hope it will work in the future. If you're attending the conference, be sure to stop by to learn about OpenSolaris, and what makes Solaris (and Solaris developers) tick. We'll also be hanging around the Sun booth during the rest of the conference, giving mini-presentations, demos, answering questions, and chatting with anyone who will listen. We're happy to talk about OpenSolaris, Solaris, Sun, or your favorite scenes from <a href="http://www.imdb.com/title/tt0071853/">Monty Python and the Holy Grail</a>. Oh yeah, there will be lots of T-shirts and Solaris 10 DVDs as well.</p>https://blogs.oracle.com/eschrock/entry/in_the_news_comIn the News(.com)eschrockhttps://blogs.oracle.com/eschrock/entry/in_the_news_com
Thu, 12 May 2005 15:50:36 +0000OpenSolaris<p>So it looks like my blog made it over to the frontpage of news.com in <a href="http://news.com.com/Major+Solaris+features+slip+to+2006/2100-1016_3-5705288.html?tag=nefd.top">this article</a> about slipping Solaris 10 features. Don't get your hopes up - I'm not going to refute Genn's claims; we certainly are not scheduled for a specific update at the moment. But pay attention to the details: ZFS and Janus will be available in an earlier Solaris Express release. I also find it encouraging that engineers like myself have a voice that actually gets picked up by the regular press (without being <a href="http://www.zdnet.com.au/news/software/0,2000061733,39190328,00.htm">blown out of proportion</a> or <a href="http://linux.slashdot.org/linux/04/09/27/1311221.shtml">slashdotted</a>).</p>
<p>I would like to point out that I putback the last major chunk of command redesign to the ZFS gate yesterday ;-) There are certainly some features left to implement, but the fact that I re-whacked all of the userland components (within six weeks, no less) should not be interpreted as any statement of schedule plans. Hopefully I can get into some of the details of what we're doing but I don't want to be seen as promoting vaporware (even though we have many happy beta customers) or exposing unfinished interfaces which are subject to change.</p>
<p>I also happen to be involved with the ongoing Janus work, but that's another story altogether. I swear there's no connection between myself and slipping products (at least not one where I'm the cause).</p>
<p><b>Update: </b>So much for not getting <a href="http://www.osnews.com/comment.php?news_id=10581">blown out of proportion</a>. Leave it to the second tier news sites to turn "not scheduled for an update" into "delayed indefinitely over deficiencies". Honestly, rewriting 5% of the code should hardly be interpreted as "delayed indefinitely" - so much for legitimate journalism. Please keep in mind that all features will hit Software Express before a S10 Update, and OpenSolaris even sooner.</p>https://blogs.oracle.com/eschrock/entry/gdb_to_mdbGDB to MDB Migration, Part Oneeschrockhttps://blogs.oracle.com/eschrock/entry/gdb_to_mdb
Tue, 10 May 2005 19:13:35 +0000OpenSolaris<p>In past comments, it has been pointed out that a transition guide between GDB and MDB would be useful to some developers out there. A full comparison would also cover <tt>dbx(1)</tt>, but I'll defer this to a later point. Given the number of available commands, I'll be dividing up this post into at least two pieces.</p>
<p>Before diving into too much detail, it should be noted that MDB and GDB have slightly different design goals. MDB (and KMDB) replaced the aging <tt>adb(1)</tt> and <tt>crash(1M)</tt>, and was designed primarily for post-mortem analysis and live kernel analysis. To this end, MDB presents the same interface when debugging a crash dump as when examining a live kernel. Solaris corefiles have been enhanced so that all the information for the process (including library text and type information) is present in the corefile. MDB can examine and run live processes, but lacks some of the features (source level debugging, STABS/DWARF support, conditional breakpoints, scripting language) that are standard for developer-centric tools like GDB (or dbx). GDB was designed for interactive process debugging. While you can use GDB on corefiles (and even <a href="http://lkcd.sourceforge.net/">LKCD</a> crash dumps or Linux kernels - locally and remotely), you often need the original object files to take advantage of GDB's features.</p>
<p>Before going too far into MDB, be sure to check out Jonathan's <a href="http://blogs.sun.com/roller/page/jwadams/20041007#an_mdb_1_cheat_sheet">MDB Cheatsheet</a> as a useful quick reference guide, with some examples of stringing together commands into pipelines. Seeing as how I'm not the most accomplished GDB user in the world, I'll be basing this comparison off the equivalent <a href="http://refcards.com/refcards/gdb/index.html">GDB reference card</a>.</p>
<table border=0 cellpadding=5>
<tr><td width=30/><th width=130 align=left>GDB</th><th width=130 align=left>MDB</th><th align=left>Description</th></tr></tr>
<tr><td colspan=4><h3>Starting Up</h3></td></tr>
<tr><td/><td valign=top><tt>gdb <i>program</i></tt></td><td valign=top><tt>mdb <i>path</i></tt><br/><tt>mdb -p <i>pid</i></tt></td><td valign=top>Start debugging a command or running process. GDB will treat numeric arguments as pids, while mdb explicitly requires the '-p' option</td></tr>
<tr><td/><td valign=top><tt>gdb <i>program</i> <i>core</i></tt></td><td valign=top><tt>mdb <i>[ program ] core</i></tt></td><td valign=top>Debug a corefile associated with 'program'. For MDB, the program is optional and is generally unnecessary given the corefile enhancements made during Solaris 10.</td></tr>
<tr><td colspan=4><h3>Exiting</h3></td></tr>
<tr><td/><td valign=top><tt>quit</tt></td><td valign=top><tt>::quit</tt></td><td valign=top>Both programs also exit on Ctrl-D.</td></tr>
<tr><td colspan=4><h3>Getting Help</h3></td></tr>
<tr><td/><td valign=top><tt>help</tt><br/><tt>help <i>command</i></tt></td><td valign=top><tt>::help</tt><br/><tt>::help <i>dcmd</i></tt><br/><tt>::dcmds</tt><br/><tt>::walkers</tt><br/></td><td>In mdb, you can list all the available walkers or dcmds, as well as get help on a specific dcmd. Another useful trick is <tt>::dmods -l <i>module</i></tt> which lists walkers and dcmds provided by a specific module.</td></tr>
<tr><td colspan=4><h3>Running Programs</h3></td></tr>
<tr><td/><td valign=top><tt>run <i>arglist</i></tt></td><td valign=top><tt>::run <i>arglist</i></tt></td><td valign=top>Runs the program with the given arguments. If the target is currently running, or is a corefile, MDB will restart the program if possible.</td></tr>
<tr><td/><td valign=top><tt>kill</tt></td><td valign=top><tt>::kill</tt></td><td
valign=top>Forcibly kill and release target.</td></tr>
<tr><td/><td valign=top><tt>show env</tt></td><td
valign=top><tt>::getenv</tt></td><td valign=top>Display current environment.</td></tr>
<tr><td/><td valign=top><tt>set env <i>var string</i></tt></td><td valign=top><tt>::setenv <i>var=string</i></tt></td><td>Set an environment variable.</td></tr>
<tr><td/><td valign=top><tt>get env <i>var</i></tt></td><td
valign=top><tt>::getenv <i>var</i></tt></td><td valign=top>Get a specific environment variable.</td></tr>
<tr><td colspan=4><h3>Shell Commands</h3></td></tr>
<tr><td/><td valign=top><tt>shell <i>cmd</i></tt></td><td valign=top><tt>!
<i>cmd</i></tt></td><td valign=top>Execute the given shell command.</td></tr>
<tr><td colspan=4><h3>Breakpoints and Watchpoints</h3></td></tr>
<tr><td/><td valign=top><tt>break <i>func</i></tt><br/><tt>break <i>\*addr</i></tt></td><td
valign=top><tt><i>addr</i>::bp</tt></td><td valign=top>Set a breakpoint at the given
address or function.</td></tr>
<tr><td/><td valign=top><tt>break <i>file:line</i></tt></td><td
valign=top><tt>-</tt></td><td valign=top>Break at the given line of the file. MDB does not
support source level debugging.</td></tr>
<tr><td/><td valign=top><tt>break <i>...</i> if <i>expr</i></tt></td><td
valign=top><tt>-</tt></td><td valign=top>Set a conditional breakpoint. MDB doesn't support
conditional breakpoints, though you can get a close approximation via the
<tt>-c</tt> option (though its complicated enough to warrant its own
post).</td></tr>
<tr><td/><td valign=top><tt>watch <i>expr</i></tt></td><td
valign=top><tt><i>addr</i>::wp <i>-rwx</i> <i>[-L size]</i></td><td
valign=top>
Set a watchpoint on the given region of memory.</td></tr>
<tr><td/><td valign=top><tt>info break</tt><br/><tt>info watch</tt></td><td
valign=top><tt>::events</td><td valign=top>Display active watchpoints
and breakpoints. MDB will show you signal events as well.</td></tr>
<tr><td/><td valign=top><tt>delete [n]</tt></td><td
valign=top><tt>::delete n</td><td valign=top>Delete the given breakpoint or
watchpoints.</td></tr>
</table>
<p>I think that's enough for now; hopefully the table is at least readable. More to come in a future post.</p>https://blogs.oracle.com/eschrock/entry/bug_of_the_weekBug of the weekeschrockhttps://blogs.oracle.com/eschrock/entry/bug_of_the_week
Sun, 3 Apr 2005 12:40:04 +0000OpenSolaris<p>There are many bugs out there that are interesting, either because of an implementation detail or the debugging necessary to root cause the problem. As you may have noticed, I like to publicly expound upon the most interesting ones I've fixed (as long as it's not a security vulnerability). This week turned up a rather interesting specimen:</p>
<p>6198523 dirfindvp() can erroneously return ENOENT</p>
<p>This bug was first spotted by <a href="http://blogs.sun.com/roller/page/casper">Casper</a> back in November last year while trying to do some builds on ZFS. The basic pathology is that at some point during the build, we'd get error messages like:</p>
<p><tt>sh: cannot determine current directory</tt></p>
<p>Some ideas were kicked around by the ZFS team, and after the problem seemed to go away, the team believed that some recent mass of changes had also fixed the problem. Five months later, <a href="http://blogs.sun.com/jwadams">Jonathan</a> hit the same bug on another build machine running ZFS. As I wrote the getcwd() code, I was determined to root cause the problem this time around.</p>
<p>Back in build 56 of S10, I moved <tt>getcwd(3c)</tt> into the kernel, along with changes to store pathnames with vnodes (which is used by the <a href="http://www.sun.com/bigadmin/content/dtrace">DTrace</a> I/O provider as well as <tt>pfiles(1)</tt>). Basically, we first try to do a forward lookup on the stored pathname; if that works, then we simply return the resolved path<sup>1</sup>. If this fails (vnode paths are never guaranteed to be correct), then we have to fall back into the slow path. This slow path involves looking up the parent, finding the current vnode in parent, prepending path, and repeat. Once we reach the root of the filesystem, we have a complete path.</p>
<p>To debug this problem, I used <a href="http://blogs.sun.com/roller/resources/eschrock/getcwd.d">this D script</a> to track the behavior of <tt>dirtopath()</tt>, the function that performs the dirty work of the slow path. Running this for a while produced a tasty bit of information:</p>
<p><pre>
dirtopath /export/ws/build/usr/src/cmd/sgs/ld
lookup(/export/ws/build/usr/src/cmd, .make.dependency.8309dfdc.234596.166) failed (2)
dirfindvp(/export/ws/build/usr/src/cmd,/export/ws/build/usr/src/cmd/sgs) failed (2)
dirtopath() returned 2
</pre></p>
<p>Looking at this, it was clear that <tt>dirfindvp()</tt> (which finds a given vnode in its parent) was inappropriately failing. In particular, after a failed lookup for a temporary make file, we bail out of the loop and report failure, despite the fact that "sgs" is still sitting there in the directory. A long look at the code revealed the problem. Without revealing too much of the code (<a href="http://www.opensolaris.org">OpenSolaris</a>, where are you?), it's essentially structured like so:</p>
<p><pre>
while (!err && !eof) {
/\* ... \*/
while ((intptr_t)dp < (intptr_t)dbuf + dbuflen) {
/\* ... \*/
/\*
\* We only want to bail out if there was an error other
\* than ENOENT. Otherwise, it could be that someone
\* just removed an entry since the readdir() call, and
\* the entry we want is further on in the directory.
\*/
if (err != ENOENT) {
break;
}
}
}
</pre></p>
<p>The code is trying to avoid exactly our situation: we fail to do a lookup of a file we just saw beacuse the contents are rapidly changing. The bug is that in the while loop we have a check for <tt>!err && !eof</tt>. If we fail to look up an entry, and it's the last entry in the chunk we just read, then we'll prematurely bail out of the enclosing while loop, returning ENOENT when we shouldn't. Using <a href="http://blogs.sun.com/roller/resources/eschrock/dirfindvp.c">this test program</a>, it's easy to reproduce on both ZFS and UFS. There are several noteworthy aspects of this bug:</p>
<ul>
<li><p>The bug had been in the gate for <b>over a year</b>, and there hadn't been a single reported build failure.</p></li>
<li><p>It only happens when the cached vnode value is invalid, which is rare<sup>2</sup>.</p></li>
<li><p>It is a race condition between readdir, lookup, and remove.</p></li>
<li><p>On UFS, inodes are marked as deleted but can still be looked up until the delete queue is processed at a later point. ZFS deletes entries immediately, so this was much more apparent on ZFS.</p></li>
<li><p>Because of the above, it was incredibly transient. It would have taken an order of magnitude more time to root cause if not for <a href="http://www.sun.com/bigadmin/content/dtrace/">DTrace</a>, which excels at solving these transient phenomena</p></li>
</ul>
<p>A three line change and the bug was fixed, and will make it back to S10 in time for Update 1. If it hadn't been for those among us willing to run our builds on top of ZFS, this problem would not have been found until ZFS integrated, or a customer escalation cost the company a whole bunch of money.</p>
<hr/>
<p><sup>1</sup> There are many more subtleties here relating to Zones, and verifying that the path hasn't been changed to refer to another file. The curious among you will have to wait for OpenSolaris.</p>
<p><sup>2</sup> I haven't yet investigated why we ended up in the slow path in this case. First things first.</p>https://blogs.oracle.com/eschrock/entry/google_coredumperGoogle coredumpereschrockhttps://blogs.oracle.com/eschrock/entry/google_coredumper
Fri, 18 Mar 2005 08:25:59 +0000OpenSolaris<p>In the last few days you may have noticed that Google released a <a href="http://code.google.com">site</a> filled with Open Source applications and interfaces. First off, kudos to the Google guys for putting this together. It's always great to see a company open sourcing their tools, as well as encouraging open standards to take advantage of their services.</p>
<p>That being said, I found the <a href="https://sourceforge.net/projects/goog-coredumper/">google coredumper</a> particularly amusing. From the google page:</p>
<p><b>coredumper:</b> Gives you the ability to dump cores from programs when it was previously not possible.</p>
<p>Being very close to the debugging tools on Solaris, I was a little taken aback by this statement. On Solaris, the <tt>gcore(1)</tt> command has always been a supported tool for generating standard Solaris core files readable by any debugger. Seeing as how I can't imagine a UNIX system without this tool, I went looking in some old source trees to find out when it was originally written. While the current Solaris version has been re-written over the course of time, I did find this comment buried in the old SunOS 3.5 source:</p>
<pre>
/\*
\* gcore - get core images of running processes
\*
\* Author: Eric Cooper
\* Written: Fall 1981.
\*
\* Inspired by a version 6 program by Len Levin, 1978.
\* Several pieces of code lifted from Bill Joy's 4BSD ps.
\*/
</pre>
<p>So this tool has been a standard part of UNIX since 1981, and based on sources as old as 1978. This is why the statement that it was "previously not possible" on Linux seemed shocking to me. Just to be sure, I logged into one of our machines running Linux and tried poking around:</p>
<pre>
$ find /usr/bin -name "\*core\*"
$
</pre>
<p>No luck. Intrigued, I took a look at the google project. From the included README:</p>
<p><i>The coredumper library can be compiled into applications to create
core dumps of the running program, without having to terminate
them. It supports both single- and multi-threaded core dumps, even if
the kernel does not have native support for multi-threaded core files.
</i></p>
<p>So the design goal appears to be slightly different; being able to dump core from within the program itself. On Solaris, I would just fork/exec a copy of gcore(), or use the (unfortunately private) libproc interface to do so. I find it hard to believe that there are kernels out there without support for multi-threaded core files, though. I took a quick google search for 'gcore linux', and turned up a few mailing list articles <a href="http://lists.osdl.org/pipermail/cgl_discussion/2003-September/001579.html">here</a> <a href="http://lists.ssc.com/pipermail/linux-list/1999-May/001466.html">here</a> and <a href="http://lists.svlug.org/archives/svlug/1999-October/020449.html">here</a>. I went and downloaded the latest GDB sources, and sure enough there is a "gcore" command. I went back to our lab machine and tested it out with gdb 5.1, and sure enough it worked. But reading back the file was not as successful:</p>
<pre>
# gdb -p `pgrep nscd`
...
(gdb) info threads
7 Thread 5126 (LWP 1018) 0x420e7fc2 in accept () from /lib/i686/libc.so.6
6 Thread 4101 (LWP 1017) 0x420e7fc2 in accept () from /lib/i686/libc.so.6
5 Thread 3076 (LWP 1016) 0x420e7fc2 in accept () from /lib/i686/libc.so.6
4 Thread 2051 (LWP 1015) 0x420e0037 in poll () from /lib/i686/libc.so.6
3 Thread 1026 (LWP 1014) 0x420e7fc2 in accept () from /lib/i686/libc.so.6
2 Thread 2049 (LWP 1013) 0x420e0037 in poll () from /lib/i686/libc.so.6
1 Thread 1024 (LWP 1007) 0x420e7fc2 in accept () from /lib/i686/libc.so.6
(gdb) bt
#0 0x420e7fc2 in accept () from /lib/i686/libc.so.6
#1 0x40034603 in accept () from /lib/i686/libpthread.so.0
#2 0x0804acd5 in geteuid ()
#3 0x4002ffef in pthread_start_thread () from /lib/i686/libpthread.so.0
(gdb) gcore
Saved corefile core.1014
(gdb) quit
The program is running. Quit anyway (and detach it)? (y or n) y
# gdb core.1014.
...
"/tmp/core.1014": not in executable format: File format not recognized
(gdb) quit
# gdb /usr/sbin/nscd core.1014
...
Core was generated by `/usr/sbin/nscd'.
Program terminated with signal 17, Child status changed.
#0 0x420e0037 in poll () from /lib/i686/libc.so.6
(gdb) info threads
7 process 67109878 0x420e7fc2 in accept () from /lib/i686/libc.so.6
6 process 134284278 0x420e0037 in poll () from /lib/i686/libc.so.6
5 process 67240950 0x420e7fc2 in accept () from /lib/i686/libc.so.6
4 process 134415350 0x420e7fc2 in accept () from /lib/i686/libc.so.6
3 process 201589750 0x420e7fc2 in accept () from /lib/i686/libc.so.6
2 process 268764150 0x420e7fc2 in accept () from /lib/i686/libc.so.6
\* 1 process 335938550 0x420e0037 in poll () from /lib/i686/libc.so.6
(gdb) bt
#0 0x420e0037 in poll () from /lib/i686/libc.so.6
#1 0x0804aca8 in geteuid ()
#2 0x4002ffef in pthread_start_thread () from /lib/i686/libpthread.so.0
(gdb) quit
#
</pre>
<p>This whole exercise was rather distressing, and brought me straight back to college when I had to deal with gdb on a regular basis (<a href="http://www.cs.brown.edu">Brown</a> moved to Linux my senior year and I was responsible (together with Rob) for porting the <a href="http://www.cs.brown.edu/courses/cs169/docs/sim.pdf">Brown Simulator</a> and <a href="http://www.cs.brown.edu/courses/cs169/docs/khg.pdf">Weenix OS</a> from Solaris). Everything seemed fine when first attaching to the process; the gcore command appeared to work fine. But when reading back a corefile, gdb can't understand a lone corefile, the process/thread IDs have been completely garbled, and I've lost floating point state (not shown above). It makes me glad that we have MDB, and <a href="http://blogs.sun.com/roller/page/ahl/?anchor=number_13_of_20_core">configurable corefile content</a> in Solaris 10.</p>
<p>This is likely an unfair comparison since it's using GDB version 5.1, when the latest is 6.3, but at least it validates the existence of the google library. I always pay attention to debugging tools around the industry, but it seems like I need to get a little more hands-on experience to really guage the current state of affairs. I'll have to get access to a system running a more recent version of GDB to see if it is any better before drawing any definitive conclusions. Then again, Solaris has had a working gcore(1) and mdb(1)/adb(1) since the SunOS days back in the 80s, so I don't see why I should have to lower my expectations just because it's GNU/Linux.</p>https://blogs.oracle.com/eschrock/entry/yet_another_interesting_bugAnother interesting bugeschrockhttps://blogs.oracle.com/eschrock/entry/yet_another_interesting_bug
Thu, 17 Mar 2005 21:27:30 +0000OpenSolaris<p>I know it's been a long time since I posted a blog entry. But I've either been too busy, out of the country, working on (not yet public) projects, or fixing relatively uninteresting bugs. But last week I finally nailed down a nasty bug that had been haunting me for several weeks, so I thought I'd share some of the experience. I apologize if this post gets a little too technical and/or incomprehensible. But I found it to be an interesting exercise, and hopefully sharing it will get me back in the blogging spirit.</p>
<p>First a little background. In Solaris, we have a set of kernel functions known as 'copyops' used to transfer data between the kernel and userland. In order to support watchpoints and SunCluster, we maintain a backup vector of functions used when one of these fails. For example, if you have a piece of data on a watched page, we keep that page entirely unmapped. If the kernel tries to read data from this page, the <tt>copyin()</tt> function will initially fail, before falling back on <tt>watch_copyin()</tt>. This goes and temporarily maps in the page, does the copy (triggering a watchpoint if necessary) and then unmapping the page. In this way, the average kernel consumer has no idea that there was a watched area on the page.</p>
<p>Clustering uses this facility in their pxfs (proxy filesystem) implementation. In order to support ioctl() calls that access an unspecified amount of memory, they use the copyops vector to translate any reads or writes into over-the-wire requests for the necessary data. These requests are always done from kernel threads, with no attached user space, so any attempt to access userland should fault before vectoring off to their copyops vector.</p>
<p>OK, on to the bug. During testing, SunCluster folks found that they were getting essentially random memory corruption during some <tt>ioctl()</tt> calls over pxfs on SPARC machines. After trying in vain to understand the crash dumps, the Clustering folks were able to reproduce the problem on DEBUG bits. In addition to getting traptrace output (a black-box style record of OS traps), the kernel failed an <tt>ASSERT()</tt> deep in the sfmmu HAT (Spitfire Memory Management Unit Hardware Address Translation) layer during a <tt>copyin()</tt> call. This smoking gun pointed straight to the copyops. We expect a kernel thread accessing userland to generate a <tt>T_DATA_EXCEPTION</tt> trap, but instead we were getting a <tt>T_DATA_MMU_MISS</tt> trap, which the HAT was not prepared to handle (nor should it have to).</p>
<p>I spent nearly a week enhancing my copyops test suite, and following several wrong paths deep into SPARC trap tables and the HAT layer. But no amount of testing could reproduce the problem. Finally, I noticed that we had reached the sfmmu assertion as a kernel thread, but our secondary ASI was set to <tt>INVALID_CONTEXT</tt> instead of <tt>KCONTEXT</tt>. On SPARC, all addresses are implicitly tagged with an ASI (address space identifier) that lets us refer to kernel addresses and user addresses without having to share the address psace like we do on x86. All kernel threads are supposed to use <tt>KCONTEXT</tt> (0) as their secondary ASI. <tt>INVALID_CONTEXT</tt> (1) is reserved for userland threads in various invalid states. Needless to say, this was confusing.</p>
<p>I knew that somehow we were setting the secondary ASI improperly, or forgetting to set it when we should. I began adding some ASSERTs to a custom kernel and quickly ruled out the former. Finally I booted a kernel with some debug code added to <tt>resume()</tt>, and panicked almost instantly. It was clear that we were coming out of <tt>resume()</tt> as a kernel thread, but with <tt>INVALID_CONTEXT</tt> as our secondary ASI. Many hours of debugging later, I finally found my culprit in <tt>resume_from_zombie()</tt>, which is used when resuming from an exited thread. When a user thread is exiting, we re-parent to p0 (the kernel 'sched' process) and set our secondary ASI to <tt>INVALID_CONTEXT</tt>. If, in <tt>resume()</tt>, we switch from one of these threads to another kernel thread, we see that they both belong to the same process (p0) and don't bother to re-initialize the secondary ASI. We even have a function, <tt>hat_thread_exit()</tt>, designed to do exactly this, only it was a no-op on SPARC. I added a call to <tt>sfmmu_setctx_sec()</tt> to this function, and the problem disappeared. Technically, this has been a bug since the dawn of time, but it had no ill side effects until I changed the way the copyops were used, and SunCluster began testing on S10.</p>
<p>Besides the sheer amount of debugging effort, this bug was interesting for several reasons:</p>
<ul>
<li>It was impossible to root cause on a non-DEBUG kernel. While we try to make the normal kernel as debuggable as possible, memory corruption (especially due to corruption in the HAT layer) is one of those problems that needs to be caught as close to the source as possible. Solaris has a huge amount of debug code, as well as facilities like traptrace that can only be enabled on a debug kernel due to performance overhead.</li>
<li>The cause of the problem was separated from the symptom by an arbitrary period of time. Once we switched to a kernel thread with a bad ASI, we could harmlessly switch between any number of kernel threads before we find one that actually tries to access userland.</li>
<li>It was completely unobservable in constrained test scenarios. We not only needed to create kernel threads that accessed userland, but we needed to have a userland thread exit and then switch immediately to one of these threads. Needless to say, this is not easy to reproduce, especially when you don't understand exactly what's going on.</li>
<li>This would have been nearly unsolvable on most other OSes. Without a kernel debugger, post mortem crash dump analysis, and tools like DTrace and traptrace records, I doubt I could have ever solved this problem. This is one of those situations where a stack trace and a bunch of <tt>printf()</tt> calls would never have solved the problem.</li>
</ul>
<p>While this wasn't the most difficult problem I've ever had to debug, it certainly ranks up there in recent memory.</p>https://blogs.oracle.com/eschrock/entry/whatthread::whatthread and MDB moduleseschrockhttps://blogs.oracle.com/eschrock/entry/whatthread
Wed, 2 Feb 2005 14:33:41 +0000OpenSolaris<p>A <a href="http://blogs.sun.com/roller/page/eschrock/20040822#kernel_debugging_part_1_kmdb">long time ago</a> I described a debugging problem where it was necessary to determine which threads owned a reader lock. In particular, I used the heuristic that if the address of the rwlock is in a particular thread's stack, then it most likely held by the thread (and can be verified by examining the thread's stack). This works 99% of the time, because you typically have the following:</p>
<pre>
rw_enter(lock, RW_READER);
/\* ... do something ... \*/
rw_exit(lock);
</pre>
<p>The compiler has to preserve the address of the lock across all the junk in the middle, so it almost always ends up getting pushed on the stack. At described in the previous post, this means a combination of <tt>::kgrep</tt> and <tt>::whatis</tt>, plus some hand-pruning, to get the threads in question. At the time, I mentioned how nice it would be to have a dedicated command do this dirty work. Now that Solaris 10 has shipped, I finally sat down and gave it a try. In a testament to MDB's well-designed interfaces, I was able to write the entire command in under 5 minutes with just 50 lines of code. On top of that, it runs in a fraction of the time. Rather than searching the entire address space, we only have to look at the stack for each thread. For example:</p>
<pre>
&gt; c8d45bb6::kgrep | ::whatis
c8d45ae4 is c8d45aa0+44, allocated as a thread structure
cae92ed8 is in thread c8d45aa0's stack
cae92ee4 is in thread c8d45aa0's stack
cae92ef8 is in thread c8d45aa0's stack
cae92f24 is in thread c8d45aa0's stack
&gt; c8d45bb6::whatthread
0xc8d45aa0
&gt;
</pre>
<p>The simple output allows it to be piped to <tt>::findstack</tt> to quickly locate questionable threads. There have been discussions about maintaining a very small set of held reader locks in the thread structure, but it's a difficult problem to solve definitively (without introducing massive performance regressions).</p>
<p>This demonstrates an oft-overlooked benefit of MDB. Though very few developers exist outside of the Solaris group, developing MDB modules is extremely simple and powerful (there are more than 500 commands and walkers in MDB today). Over time, I think I've almost managed to suppress all the painful GDB memories from my college years...<p>https://blogs.oracle.com/eschrock/entry/dtrace_and_customer_serviceDTrace and customer serviceeschrockhttps://blogs.oracle.com/eschrock/entry/dtrace_and_customer_service
Tue, 1 Feb 2005 10:57:06 +0000OpenSolaris<p>Today, I thought I'd share a real-world experience that might portray DTrace in a slightly different light than you're used to. The other week, I was helping a customer with the following question:</p>
<p><i>Why is automountd constantly taking up 1.2% of CPU time?</i></p>
<p>The first thought that came to mind was a broken automountd. But if that were the case, you'd be more likely to see it spinning and stealing 100% of the CPU. Just to be safe, I asked the customer to send <tt>truss -u a.out::</tt> output for the automountd process. As expected, I saw automountd chugging away, happily servicing each request as it came in. Automountd was doing nothing wrong - some process was indirectly sending millions of requests a day to the automounter. Taking a brief look at the kernel code, I responded with the following D script:</p>
<pre>
#!/usr/sbin/dtrace -s
auto_lookup_request:entry
{
@lookups[execname, stringof(args[0]->fi_path)] = count();
}
</pre>
<p>The customer gave it a shot, and found a misbehaving program that was continuously restarting and causing loads of automount activity. Without any further help from me, the customer could easily see exactly which application was the source of the problem, and quickly fixed the misconfiguration.</p>
<p>Afterwards, I reflected on how simple this exchange was, and how difficult it would have been in the pre-Solaris 10 days. Now, I don't expect customers to be able to come up with the above D script on their own (though industrious admins will soon be able to wade through OpenSolaris code). But I was able to resolve their problem in just 2 emails. I was reminded of the infamous <tt>gtik2_applet2</tt> fiasco described in the DTrace <a href="http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf">USENIX paper</a> - <tt>automountd</tt> was just a symptom of an underlying problem, part of an interaction that was prohibitively difficult to trace to its source. One could turn on <tt>automountd</tt> debug output, but you'd still only see the request itself, not where it came from. To top it off, the offending processes were so short-lived, that they never showed up in <tt>prstat(1)</tt> output, hiding from traditional system-wide tools.</p>
<p>After a little thought, I imagined a few Solaris 9 scenarios where I'd either set a kernel breakpoint via <tt>kadb</tt>, or set a user breakpoint in automountd and use <tt>mdb -k</tt> to see which threads were waiting for a response. But these (and all other solutions I came up with) were:</p>
<ul>
<li>Disruptive to the running system</li>
<li>Not guaranteed to isolate the particular problem</li>
<li>Difficult for the customer to understand and execute</li>
</ul>
<p>It really makes me feel the pain our customer support staff must go through now to support Solaris 8 and Solaris 9. DTrace is such a fundamental change in the debugging and observability paradigm that it changes not only the way we kernel engineers work, but also the way people develop applications, administer machines, and support customers. Too bad we can't EOL Solaris 8 and Solaris 9 next week for the benefit of Sun support...</p>