Move the cursor to the end of the text to be cut/copied. (While selecting text, you can perform searches and other advanced movement, a feature that sets vim apart from most other editors.)

Press d (as in "delete") to cut, or y (as in "yank", which I imagine meaning "yank so hard and fast that it leaves a copy behind") to copy.

Move the cursor to the desired paste location.

Press p to paste after the cursor, or P to paste before.

In gvim, visual marking (steps 1-3) can be replaced by selecting text using a mouse or similar pointing device, although I strongly prefer to navigate using the keyboard.

Bonus tip: To replace the selected text with new text (to be entered by you), press 'c' instead of 'd' or 'p' on step 4. This deletes the selection and leaves you in insert mode. Then, instead of (or prior to) steps 5-6, type your replacement text.

edit Pasting over a block of textYou can copy a block of text by pressing Ctrl-v (or Ctrl-q if you use Ctrl-v for paste), then moving the cursor to select, and pressing y to yank. Now you can move elsewhere and press p to paste the text after the cursor (or P to paste before). The paste inserts a block (which might, for example, be 4 rows by 3 columns of text).

Instead of inserting the block, it is also possible to replace (paste over) the destination. To do this, move to the target location then press 1vp (1v selects an area equal to the original, and p pastes over it).

When a count is used before v, V, or ^V (character, line or block selection), an area equal to the previous area, multiplied by the count, is selected. See the paragraph after :help <LeftRelease>.

edit CommentsIf you just want to copy (yank) the visually marked text, you do not need to 'y'ank it. Marking it will already copy it.

Using a mouse, you can insert it at another position by clicking the middle mouse button.

This also works in across vim applications on Windows systems (clipboard is inserted)

This is a really useful thing in Vim. I feel lost without it in any other editor. I have some more points I'd like to add to this tip:

While in (any of the three) Visual mode(s), pressing 'o' will move the cursor to the opposite end of the selection. In Visual Block mode, you can also press 'O', allowing you to position the cursor in any of the four corners.

If you have some yanked text, pressing 'p' or 'P' while in Visual mode will replace the selected text with the already yanked text. (After this, the previously selected text will be yanked.)

Press 'gv' in Normal mode to restore your previous selection.

It's really worth it to check out the register functionality in Vim:

:help registers

If you're still eager to use the mouse-juggling middle-mouse trick of common unix copy-n-paste, or are into bending space and time with i_CTRL-R<reg>, consider checking out ':set paste' and ':set pastetoggle'. (Or in the latter case, try with i_CTRL-R_CTRL-O..)

You can replace a set of text in a visual block very easily by selecting a block, press c and then make changes to the first line. Pressing <Esc> twice replaces all the text of the original selection. See :help v_b_c.

On Windows the <mswin.vim> script seems to be getting sourced for many users.

Result: more Windows like behavior (ctrl-v is "paste", instead of visual-block selection). Hunt down your system vimrc and remove sourcing thereof if you don't like that behavior (or substitute <mrswin.vim> in its place, see VimTip63.

With VimTip588 one can sort lines or blocks based on visual-block selection.

Select the inner block to copy usint ctrl-v and highlighting with the hjkl keys
yank the visual region (y)
Select the inner block you want to overwrite (Ctrl-v then hightlight with hjkl keys)
paste the selection P (that is shift P) , this will overwrite keeping the block formation

The "yank" buffers in vim are not the same as the Windows clipboard (i.e., cut-and-paste) buffers. If you're using the yank, it only puts it in a vim buffer - that buffer is not accessible to the Windows paste command. You'll want to use the Edit | Copy and Edit | Paste (or their keyboard equivalents) if you're using the Windows GUI, or select with your mouse and use your X-Windows cut-n-paste mouse buttons if you're running UNIX.

There are some caveats regarding how the "*y (copy into System Clipboard) command works. We have to be sure that we are using vim-full (sudo aptitude install vim-full on debian-based systems) or a vim that has X11 support enabled. Only then will the "*y command work.

For our convenience as we are all familiar with using Ctrl+c to copy a block of text in most other GUI applications, we can also map Ctrl+c to "*y so that in Vim Visual Mode, we can simply Ctrl+c to copy the block of text we want into our system buffer. To do that, we simply add this line in our .vimrc file:

map <C-c> "*y<CR>

Restart our shell and we are good. Now whenever we are in Visual Mode, we can Ctrl+c to grab what we want and paste it into another application or another editor in a convenient and intuitive manner.

Pasting text into a terminal running Vim with automatic indentation enabled can destroy the indentation of the pasted text. This tip shows how to avoid the problem.
See How to stop auto indenting for automatic indentation issues while you are typing.

If you use Vim commands to paste text, nothing unexpected occurs. The problem only arises when pasting from another application, and only when you are not using a GUI version of Vim.
In a console or terminal version of Vim, there is no standard procedure to paste text from another application. Instead, the terminal may emulate pasting by inserting text into the keyboard buffer, so Vim thinks the text has been typed by the user. After each line ending, Vim may move the cursor so the next line starts with the same indent as the last. However, that will change the indentation already in the pasted text.

Paste toggle

Put the following in your vimrc (change the <F2> to whatever key you want):

set pastetoggle=<F2>

To paste from another application:

Press (toggles the 'paste' option on).

Use your terminal to paste text from the clipboard. (Shift - Insert key)

Press (toggles the 'paste' option off).

Then the existing indentation of the pasted text will be retained.
If you have a mapping for , that mapping will apply (and the 'pastetoggle' function will not operate).
Some people like the visual feedback shown in the status line by the following alternative for your vimrc:

The first line sets a mapping so that pressing <F2> in normal mode will invert the 'paste' option, and will then show the value of that option. The second line does the same in insert mode (but insert mode mappings only apply when 'paste' is off). The third line allows you to press <F2> when in insert mode, to turn 'paste' off.

References

Computerworld - Data stored on disk is made up of long strings (called tracks and sectors) of ones and zeroes. Disk heads read these strings one bit at a time until the drive accumulates the desired quantity of data and then sends it to the processor, memory or other storage devices. How the drive sends that data affects overall performance.

Years ago, all data sent to and from disks traveled in serial form—one bit was sent right after another, using just a single channel or wire.
More
Computerworld
QuickStudies

With integrated circuits, however, it became feasible and cheap to put multiple devices on a single piece of silicon, and the parallel interface was born. Typically, it used eight channels for transmission, allowing eight bits (one byte) to be sent simultaneously, which was faster than straight serial connections. The standard parallel interface used a bulky and expensive 36-wire cable.

So why are vendors dropping parallel interfaces in favor of serial ones, when we need to get data to and from disks faster than ever?

For example, most printers don't even come with parallel ports anymore. Laptops have dropped traditional parallel and serial ports in favor of higher-speed Universal Serial Bus and IEEE 1394 ports. [See QuickLink 29332 for more about these technologies.] We now see this same migration in the interfaces that connect disk drives.

At first glance, this seems counterintuitive. Isn't parallel more efficient than serial, with more capacity? Not really, and certainly not anymore. At current speeds, parallel transmission has several disadvantages.

Processing Overhead

First, remember that data is stored and retrieved one track at a time, one bit at a time. We talk about bytes for convenience, but a byte is just a line of eight bits in a row, and ultimately, we have to process each bit separately.

Thus, before we can send a byte in parallel to a disk drive, we have to get those eight bits and line them up, funneling each to a different wire. When we've done all the processing and moving to get them all ready, we fire off that byte.

At the other end of the cable, when the drive receives the bits, it must go through the reverse process to convert that byte back into a serial bit stream so the disk drive write heads can write it to the disk.

To visualize this another way, think about what's almost precisely the reverse process—converting parallel to serial for transmission and back again. This is what happens in sending Morse code over a telegraph line. The message starts out as written words (think parallel) on a sheet of paper. A processor (i.e., the operator's brain) has to convert each letter into a series of dots and dashes (serial) and then send these over the wire.

At the receiving end, another processor has to listen to these serial dots and dashes, then convert them back into letters and words. A lot of overhead is required because the transmission medium doesn't match the original input or desired output.

Signal Skewing

As a signal travels over a wire or an integrated circuit trace, imperfections in the wires or integrated circuit-pad drivers can slow down some bits.

In a parallel connection, the eight bits that leave at the same time don't arrive at the other end at the same time; some will get there later than others. This is called skew. To deal with this, the receiving end has to synchronize itself with the transmitter and must wait until all bits have arrived. The sequence of processing is this: read, wait, latch, wait for clock signal, transmit.

The more wires there are and the longer the distance they span, the greater the skew and the higher the delay. This delay limits the effective clock rate as well as the length and number of parallel lines that are feasible to use.

Crosstalk

The fact that parallel wires are physically bundled means that one signal can sometimes "imprint" itself on the wire next to it. As long as the signals are distinct, this doesn't cause problems.

But as bits get closer together, signal strength attenuates over distance (especially at higher frequencies), and spurious reflections accumulate because of intermediate connectors. As a result, the possibility for error grows significantly, and the disk controller may not be able to differentiate between a one and a zero. Extra processing is needed to prevent that.

Serial buses avoid this by modifying signals at the time of transmittal to compensate for such loss. In a serial topology, all the transmission paths are well controlled with minimum variability, which allows serial transmission to run reliably at significantly higher frequencies than parallel designs.

Sunday, November 29, 2009

vim 7
Ben has installed a testing version of vim. To use it, just type vim7 instead of vim (or gvim7 instead of gvim). One nice thing about this new version is spell checking. Below are some of Ben's notes on how to use spell checking with vim7.

add this to your .vimrc:
if has("spell")
" turn spelling on by default
set spell

" they were using white on white
highlight PmenuSel ctermfg=black ctermbg=lightgray

" limit it to just the top 10 items
set sps=best,10
endif
to have a personal wordlist, make a directory called ~/.vim/spell
you can manually add things your personal wordlist (~/.vim/spell/en.latin1.add):
printf( (so printf is invalid, but printf( is ok)
nextLine()
ArrayList/= (the /= means always match case)
focussed/! (the /! says treat this as a misspelling)
if you manually add to your wordlist, you need to regenerate it:
:mkspell! ~/.vim/spell/en.latin1.add
some useful keys for spellchecking:
]s - forward to misspelled/rare/wrong cap word
[s - backwards

spelling
This is a good place to learn about plugins. Go to www.vim.org, click on the Search link, search the scripts for spelling, and then download the vimspell.vim plugin. To install it, put it in your ~/.vim/plugin directory. Then vim a file and type :help vimspell.

The page on word completion also shows how to load a dictionary. You probably want to put the :set dictionary=/usr/share/dict/words in your .vimrc file.

mapping
Use mappings to save typing for things you frequently type. The first one below, when typing in insert mode, changes every occurence of ;so to System.out.println(); and leaves you in insert mode between the parentheses!

imap ;so System.out.println();
imap ;ne /;a
vmap ;bo "zdiz
The second one above, while in insert mode, moves you to the end of the next line when you type ;ne. The last one puts bold html tags around something you have visually selected.

These usually go in your .vimrc file. You can even have certain mappings loaded based on the type of file you are editing.

tags
Using tags makes it easier to jump to certain parts of your programs. First run ctags from the UNIX command line on your source files (e.g., ctags prog.c or ctags -R to recurse) to generate a "tags" file, then use these while editing your source files:

:tag TAB - list the known tags
:tag function_name - jump to that function
ctrl-t - goes to previous spot where you called :tag
ctrl-] - calls :tag on the word under the cursor
:ptag - open tag in preview window (also ctrl-w })
:pclose - close preview window

markers
Use markers to set places you want to quickly get back to, or to specify a block of text you want to copy or cut.
mk - mark current position (can use a-z)
'k - move to mark k
d'k - delete from current position to mark k
'a-z - same file
'A-Z - beteween files

text selection
If you want to do the same thing to a collection of lines, like cut, copy, sort, or format, you first need to select the text. Get out of insert mode, hit one of the options below, and then move up or down a few lines. You should see the selected text highlighted.

Ascii vs. Binary Files
Introduction
Most people classify files in two categories: binary files and ASCII (text) files. You've actually worked with both. Any program you write (C/C++/Perl/HTML) is almost surely an ASCII file.
An ASCII file is defined as a file that consists of ASCII characters. It's usually created by using a text editor like emacs, pico, vi, Notepad, etc. There are fancier editors out there for writing code, but they may not always save it as ASCII.

As an aside, ASCII text files seem very "American-centric". After all, the 'A' in ASCII stands for American. However, the US does seem to dominate the software market, and so effectively, it's an international standard.

Computer science is all about creating good abstractions. Sometimes it succeeds and sometimes it doesn't. Good abstractions are all about presenting a view of the world that the user can use. One of the most successful abstractions is the text editor.

When you're writing a program, and typing in comments, it's hard to imagine that this information is not being stored as characters. Of course, if someone really said "Come on, you don't really think those characters are saved as characters, do you? Don't you know about the ASCII code?", then you'd grudgingly agree that ASCII/text files are really stored as 0's and 1's.

But it's tough to think that way. ASCII files are really stored as 1's and 0's. But what does it mean to say that it's stored as 1's and 0's? Files are stored on disks, and disks have some way to represent 1's and 0's. We merely call them 1's and 0's because that's also an abstraction. Whatever way is used to store the 0's and 1's on a disk, we don't care, provided we can think of them that way.

The Difference between ASCII and Binary Files?
An ASCII file is a binary file that stores ASCII codes. Recall that an ASCII code is a 7-bit code stored in a byte. To be more specific, there are 128 different ASCII codes, which means that only 7 bits are needed to represent an ASCII character.
However, since the minimum workable size is 1 byte, those 7 bits are the low 7 bits of any byte. The most significant bit is 0. That means, in any ASCII file, you're wasting 1/8 of the bits. In particular, the most significant bit of each byte is not being used.

Although ASCII files are binary files, some people treat them as different kinds of files. I like to think of ASCII files as special kinds of binary files. They're binary files where each byte is written in ASCII code.

A full, general binary file has no such restrictions. Any of the 256 bit patterns can be used in any byte of a binary file.

We work with binary files all the time. Executables, object files, image files, sound files, and many file formats are binary files. What makes them binary is merely the fact that each byte of a binary file can be one of 256 bit patterns. They're not restricted to the ASCII codes.

Example of ASCII files
Suppose you're editing a text file with a text editor. Because you're using a text editor, you're pretty much editing an ASCII file. In this brand new file, you type in "cat". That is, the letters 'c', then 'a', then 't'. Then, you save the file and quit.
What happens? For the time being, we won't worry about the mechanism of what it means to open a file, modify it, and close it. Instead, we're concerned with the ASCII encoding.

If you look up an ASCII table, you will discover the ASCII code for 0x63, 0x61, 0x74 (the 0x merely indicates the values are in hexadecimal, instead of decimal/base 10).

Here's how it looks:

ASCII 'c' 'a' 't'
Hex 63 61 74
Binary 0110 0011 0110 0001 0111 1000

Each time you type in an ASCII character and save it, an entire byte is written which corresponds to that character. This includes punctuations, spaces, and so forth. I recall one time a student has used 100 asterisks in his comments, and these asterisks appeared everywhere. Each asterisk used up one byte on the file. We saved thousands of bytes from his files by removing comments, mostly the asterisks, which made the file look nice, but didn't add to the clarity.

Thus, when you type a 'c', it's being saved as 0110 0011 to a file.

Now sometimes a text editor throws in characters you may not expect. For example, some editors "insist" that each line end with a newline character.

What does that mean? I was once asked by a student, what happens if the end of line does not have a newline character. This student thought that files were saved as two-dimensions (whether the student realized ir or not). He didn't know that it was saved as a one dimensional array. He didn't realize that the newline character defines the end of line. Without that newline character, you haven't reached the end of line.

The only place a file can be missing a newline at the end of the line is the very last line. Some editors allow the very last line to end in something besides a newline character. Some editors add a newline at the end of every file.

Unfortunately, even the newline character is not that universally standard. It's common to use newline characters on UNIX files, but in Windows, it's common to use two characters to end each line (carriage return, newline, which is \r and \n, I believe). Why two characters when only one is necessary?

This dates back to printers. In the old days, the time it took for a printer to return back to the beginning of a line was equal to the time it took to type two characters. So, two characters were placed in the file to give the printer time to move the printer ball back to the beginning of the line.

This fact isn't all that important. It's mostly trivia. The reason I bring it up is just in case you've wondered why transferring files to UNIX from Windows sometimes generates funny characters.

Editing Binary Files
Now that you know that each character typed in an ASCII file corresponds to one byte in a file, you might understand why it's difficult to edit a binary file.
If you want to edit a binary file, you really would like to edit individual bits. For example, suppose you want to write the binary pattern 1100 0011. How would you do this?

You might be naive, and type in the following in a file:

11000011

But you should know, by now, that this is not editing individual bits of a file. If you type in '1' and '0', you are really entering in 0x49 and 0x48. That is, you're entering in 0100 1001 and 0100 1000 into the files. You're actually (indirectly) typing 8 bits at a time.
"But, how am I suppose to edit binary files?", you exclaim! Sometimes I see this dilemma. Students are told to perform a task. They try to do the task, and even though their solution makes no sense at all, they still do it. If asked to think about whether this solution really works, they might eventually reason that it's wrong, but then they'd ask "But how do I edit a binary file? How do I edit the individual bits?"

The answer is not simple. There are some programs that allow you type in 49, and it translates this to a single byte, 0100 1001, instead of the ASCII code for '4' and '9'. You can call these programs hex editors. Unfortunately, these may not be so readily available. It's not too hard to write a program that reads in an ASCII file that looks like hex pairs, but then converts it to a true binary file with the corresponding bit patterns.

That is, it takes a file that looks like:

63 a0 de

and converts this ASCII file to a binary file that begins 0110 0011 (which is 63 in binary). Notice that this file is ASCII, which means what's really stored is the ASCII code for '6', '3', ' ' (space), 'a', '0', and so forth. A program can read this ASCII file then generate the appropriate binary code and write that to a file.

Thus, the ASCII file might contain 8 bytes (6 for the characters, 2 for the spaces), and the output binary file would contain 3 bytes, one byte per hex pair.

Viewing Binary Files
Most operating systems come with some program that allows you to view a file in "binary" format. However, reading 0's and 1's can be cumbersome, so they usually translate to hexadecimal. There are programs called hexdump which come with the Linux distribution or xxd.
While most people prefer to view files through a text editor, you can only conveniently view ASCII files this way. Most text editors will let you look at a binary file (such as an executable), but insert in things that look like ^@ to indicate control characters.

A good hexdump will attempt to translate the hex pairs to printable ASCII if it can. This is interesting because you discover that in, say, executables, many parts of the file are still written in ASCII. So this is a very useful feature to have.

Writing Binary Files, Part 2
Why do people use binary files anyway? One reason is compactness. For example, suppose you wanted to write the number 100000. If you type it in ASCII, this would take 6 characters (which is 6 bytes). However, if you represent it as unsigned binary, you can write it out using 4 bytes.
ASCII is convenient, because it tends to be human-readable, but it can use up a lot of space. You can represent information more compactly by using binary files.

For example, one thing you can do is to save an object to a file. This is a kind of serialization. To dump it to a file, you use a write() method. Usually, you pass in a pointer to the object and the number of bytes used to represent the object (use the sizeof operator to determine this) to the write() method. The method then dumps out the bytes as it appears in memory into a file.

You can then recover the information from the file and place it into the object by using a corresponding read() method which typically takes a pointer to an object (and it should point to an object that has memory allocated, whether it be statically or dynamically allocated) and the number of bytes for the object, and copies the bytes from the file into the object.

Of course, you must be careful. If you use two different compilers, or transfer the file from one kind of machine to another, this process may not work. In particular, the object may be laid out differently. This can be as simple as endianness, or there may be issues with padding.

This way of saving objects to a file is nice and simple, but it may not be all that portable. Furthermore, it does the equivalent of a shallow copy. If your object contains pointers, it will write out the addresses to the file. Those addresses are likely to be totally meaningless. Addresses may make sense at the time a program is running, but if you quit and restart, those addresses may change.

This is why some people invent their own format for storing objects: to increase portability.

But if you know you aren't storing objects that contain pointers, and you are reading the file in on the same kind of computer system you wrote it on, and you're using the same compiler, it should work.

This is one reason people sometimes prefer to write out ints, chars, etc. instead of entire objects. They tend to be somewhat more portable.

Summary
An ASCII file is a binary file that consists of ASCII characters. ASCII characters are 7-bit encodings stored in a byte. Thus, each byte of an ASCII file has its most significant bit set to 0. Think of an ASCII file as a special kind of binary file.
A generic binary file uses all 8-bits. Each byte of a binary file can have the full 256 bitstring patterns (as opposed to an ASCII file which only has 128 bitstring patterns).

There may be a time where Unicode text files becomes more prevalent. But for now, ASCII files are the standard format for text files.

Static global variables are declared as "static" at the top level of a source file. Such variables are not visible outside the source file ("file scope"), unlike variables declared as "extern".

Static local variables are declared inside a function, just like automatic local variables. They have the same scope as normal local variables, differing only in "storage duration": whatever values the function puts into static local variables during one call will still be present when the function is called again.

Thursday, November 26, 2009

Compilation can involve up to four stages: preprocessing, compilation proper, assembly and linking, always in that order. GCC is capable of preprocessing and compiling several files either into several assembler input files, or into one assembler input file; then each assembler input file produces an object file, and linking combines all the object files (those newly compiled, and those specified as input) into an executable file.

Every major new version of the operating system has a learning curve. There's just no way around it. Services advance, hopefully, and support files need to be changed accordingly.

The problem is that with the operating system changing so fast it sometimes takes a while for the documentation to catch up. This leaves you, the admin, holding the bag when adjusting your configurations with little help from the outside world.

Have no fear, ktrace is here.

Ktrace allows you to trace events going through the kernel. Using this incredibly powerful utility you can get an easy look into what's going on in the recesses of your system.

Let me use a real example to illustrate how incredibly useful ktrace is.

Installations that care about security tend to dislike guest access to file servers regardless of what form it takes. This is easy enough to do on a server, but how do you do this with OS X client?

It was fairly well known that on 10.2 all of this information was kept inside NetInfo. With a little bit of digging you would find an entry under the config directory for the AppleFileServer that would allow you to turn off any form of guest access. Bye-bye drop boxes, hello security!

So, you added this bit flip into your standard client image and felt pretty smug about yourself. Problem is when 10.3 came around this all went out the window and you were back to square one. Sure it may be common knowledge now where the configuration is, but walk with me here a bit.

You take stock of the situation. You know the file server is the AppleFileServer process and that hasn't changed. You've scoured NetInfo and found nothing of use. You know that there is most likely a flat file on the system that is used for this, but where? You could grep your entire harddrive, but geez, that's not very elegant is it?

Instead with ktrace and two minutes of time you'll have your answer.

Ktrace will launch a process while telling the kernel to keep track of what the process does. Ktrace takes this information and saves it to a dump file. You then can use kdump to turn that dump file into plain English where knowledge will ensue.

First make sure that the AppleFileServer process is not running. Can't have two servers running at the same time now, can you?

Next use ktrace to launch the server.

sudo ktrace AppleFileServer

Give it a minute or too to log some good information then kill off the server and use ktrace to turn off all kernel traces so you don't waste your processor.

sudo killall AppleFileServer
sudo ktrace -C

If you now look in the directory that you were in when you ran the commands you should see a ktrace.out file. This is the raw dump of the kernel information. Not readable by humans. You'll need to use the kdump command to convert this to a readable form. For extra credit you'll use this with the 'open' command to pull it up in TextEdit

TextEdit should now be open with a file many pages in length. A quick scan of it should immediately show you what you are looking for. However, we'll be a bit more methodical about it. Do a search on 'open' and you'll find all of the files that the process wanted to open. Some didn't exist and some did. You could also guess that Apple would put the configuration into a property list file, like every other config file they have. So you could search on "plist" too. Either way after a tiny bit of digging you should see this:

A quick trip to the Finder will show you that the file exists and can be opened, like any plist file, by a text editor or the Property List Editor application.

But wait, the fun doesn't stop there! Not only does ktrace show you the file that was accessed it also logs what was read into the file. Here you'll find the fleck of gold you're searching for.

guestAccess

All you need to do to block guest access is to switch "true" to "false" in the /Library/Preferences/com.apple.AppleFileServer.plist and your problem is solved with plenty of time left for a long lunch.

Ktrace might not be a full blown Swiss Army knife, but it's at least one of the little ones that you can use as a keychain. It has a tendency to be a bit verbose, but that's a good thing for you. Plus it's got some very cool options.

For example, I was trying to get a feel for what DirectoryService was doing on a server that was slowing down without reason. You can try killing off DirectoryService and then launching it with ktrace, but by the time you get there the kernel has already restarted it since it is one of the new Mach_init processes that get launched on demand.

I was about to get annoyed until I read the man page for ktrace. In addition to using it to launch a process you can have it attach to an already existing process and trace either the current children of that process or any new process that the parent process will spawn.

A quick peek at the beautiful looking Activity Monitor showed that the parent process of DirectoryService was mach_init, which makes a whole lot of sense. So instead of launching DirectoryService with ktrace, I instead attached ktrace to the mach_init process. It has the id of 2 and I told ktrace to monitor any newly spawned children of the parent process. Then I killed off DirectoryService and waited for it to restart.

sudo ktrace -p 2 -i
sudo killall DirectoryService

Give it a moment to reappear in the process list then turn off the logging and convert it to text.

Recently i was contacted to implement a content filter proxy using OpenSource tools. It is a very simple task to do, but my client asked to make the solution with HA (high availability) and Linux.

Well, Linux is a great operating system, but I have never built a Linux HA solution before, so I started to look for some information about the options I had.

My first search was for Linux clusters and I found them to be very complicate to build and manage. I needed my solution to be simple, fast, secure and easy to manage and implement.

During my research, I found some references to the CARP (Common Address Redundancy Protocol) protocol, a very smart and simple solution from the great folks at the OpenBSD project.

In a very simple way, the CARP protocol can make 2 or more network interfaces share the same IP address number. When the MASTER interface goes down, the BACKUP interface automagicaly takes its place.

It was not very difficult to find that CARP was ported to FreeBSD, NetBSD and with UCARP (Userland CARP) it is possible to use it under Linux too.

I like OpenBSD, but I am more comfortable working with FreeBSD, so it is my first option.

Now you may be asking, what about Linux? Why not use it?

In my opinion FreeBSD (and others BSDs as well) are better documented and, particularly, the FreeBSD handbook has almost everything you need to implement a good server. In Linux you have to look at various sites and search a lot to find some help. Maybe I am wrong, but it is my experience until now.

Implementation

I will not cover FreeBSD installation here. Look at FreeBSD handbook for more information. It is one of the best software documentations out there for an OpenSource OS.

On FreeBSD, CARP must be compiled in the kernel before using it. Build a new kernel is simple and take only a few minutes in current machines.

Those steps will compile the carp into FreeBSD kernel and reboot the system so the changes will take effect.

To understand carp, lets see the options first.

- Preemption: When you have your hosts configured with carp they can use preemption. It will make possible for one host, the one with lower advskew, to always be authoritative for the address in the carp interface. If you don’t use preemption one of the backup’s machines may become a MASTER (when the original master fails), but when the original master is online again, it will be a BACKUP. Using preemption the original MASTER will recover it’s status as soon as it goes up again.

- vhid: This is a number indicating the group of the carp interface. You may have various carp groups to build very elaborated and complex HA solutions. Here we will be using only one group. More than one group may be used to build an arp balance solution (not covered here).

- pass: This is a passphrase used to authenticate the hosts on your carp group. You will be using the same password on all hosts for the same carp group.

- advskew: This number controls the frequency on witch the master send advertisements to the other hosts. The host with the lowest number will be MASTER. You can build an hierarchy of hosts using this, determining the order that must be used in case of failures. The higher this number less frequently the advertisements will be sent, so this host may be a backup if there are others with lower advsew.

This are the options used here. There are other options that can be reviewed by reading the manual pages for carp.

Our configuration is very simple. 2 hosts sharing 1 IP address. When the master goes down, the backup takes over the IP, when master comes back, it must be master again. I don’t want load balance here and just want to keep my services in case of failures.

Notice: The 192.168.200 interfaces are connected directly to provide a faster route between the hosts so I can use it to sync files, share resources without compromising the real network. This interface is not used in the carp configuration that will work only using the other interfaces.

That’s it. Just reboot and all configuration will be up. To test it, ping the 10.1.1.10 address from a remote machine and shutdown the MASTER. You will see that the ping will not stop.

If you run an ifconfig, it will be possible to see the interfaces running:

Now you can configure your services to listen on carp0 interface. An HTTP, FTP or proxy server can be listening on this interface and when master goes down the backup will be up and running. You will only need to set up some synchronization for your files, to do this you can use rsync over ssh in a cron job or even share a remote storage to not waste space. The best solution is the one that fits your needs, so be creative and use this to increase your availability.

CARP is a very simple solution to a very common problem. The configuration is very easy to build and understand and it can be used to build different kinds of HA services.

This article describes a concept on how to implement realtime file system replication on a dual-node FreeBSD cluster to provide real HA services.Maybe you are familiar with DRBD (distributed replicated block device) from the Linux world already, which basically does something we could call network-RAID1.

Since DRBD does not run on FreeBSD one might be tempted to believe that realtime file system replication would not be possible at all. This is not true however. FreeBSD provides you with two valuable geom classes which will allow you to implement a very similar setup: ggate and gmirror.

Requirements

The absolute minimum requirements for this setup are as follows:

two hardware nodes running FreeBSD

ethernet connection between both nodes

a free (as in “unused”) disk slice on each node

All right, this is just good enough to get it going.If you are serious in useing it you may want to stick to something better than that:

Don’t use the same ethernet connection for public access AND replication, use a dedicated interface instead, preferrably over Gigabit ethernet. We’re talking about data replication over a LAN here, so latency and network load is a concern after all.

Fore the same reasons as above you should not do any geographic separation, especially not over slow links or VPN. Stay within the same network segment.

Use identical hardware for both nodes.

Use identical disk partition and slice setup on both nodes.

Use fast disks and fast disk controllers with good IO performance.

Refrein from useing geom/ataraid or other software RAID on partitions/slices mirrored to the second node. Use a real hardware RAID controller instead. If you don’t, deadlocks may occur.

Keep the partitions to be mirrored as small as possible. The reason for this is the fact that a complete resync is required if the mirror brakes. While a 20 GB partition might synchronize within ~30 minutes across a 100 Mbit network, a 500 GB partition will take over 11 hours.

You should propably not export more than one disk slice to a remote node. Every request (especially with lots and lots of write transactions) will be sent over your network. This causes load and latency on both nodes.

Use commodity hardware, no need for special shared storage like SAN or iSCSI

Do not rely on snapshot-based synchronisation (like rsync for example)

Do not rely on NFS or other file servers which could impose a single point of failure on their own

Cons

Yet experimental, not tested under heavy-load, possibly unstable

No support, if it brakes you’re on your own

Implementation not as mature as DRBD

Yet, a lot of hand work involved

#1 General System Setup

I have already pointed out some recommendations about the system setup previously. So if you stick with these you may save yourself from trouble.

When you install FreeBSD make sure you take a current 6.x series release. The 5.x series might work too though happened to be a bit flacky at my site due to locking issues. YMMV.

There are no special considerations except for the partition layout: reserve a partition which shall contain the data to be replicated to the remote-host. Don’t make it to big as the whole thing has to be synchronized over the network.

Choose the size according to your actual disk space requirement, the network speed and latency and also the IO performance of your system. A 500 GB partition may be too big, even when running over Gigabit ethernet. A size anywhere from 100 megs to 20 gigs may be ok though.Since you would hopefully have two identical nodes, make the partition tables/disk slices match each other. This will help greatly to reduce any issues because of different device names.You should also refrain from useing any geom/ataraid software RAID on the disks/slices to be exported. Remember that you will do a software RAID1 over the network already. Placeing another software RAID onto the underlying device will lead to deadlocks in most cases. Also your system will have twice the load as the data has to be written out four times actually.If you really want the additional safety of local disk RAID do yourself a favor and use a real hardware RAID controller instead. This will even help you in getting good IO performance. Of course fask disks are a must then.

Now make sure both nodes support the GEOM mirroring module. Enable it by adding the following line to your /boot/loader.conf:

geom_mirror_load=”YES”

Do the same for the GEOM gate module:

geom_gate_load=”YES”

If your secure level allows to load kernel modules at runtime you may omit these steps.Check it like this:#sysctl kern.securelevel

Any return value other than 0 or -1 denote that kernel modules may not be loaded at runtime. In this case a reboot is required to load the modules. But check out step #3 first.

#3 Configure Network Interfaces

Make sure your network interfaces are configured properly.

Since I have two of them I would use one as public interface and the other as private.

The latter one will be useing private IP addresses according to RFC1918 and is connected to the remote host useing a crossover cable.

On both hosts fxp0 is the public interface (which later on use the address 172.16.100.1 for the master node and 172.16.100.2 for the failover node).

On the master node the additional public IP address 172.16.100.12 is bound as an alias and used to provide public services. It will be monitored by freevrrpd and conditionally move over to the failover node.

fxp1 is the private interface used for data replication (192.168.100.1 for the master node and 192.168.100.2 for the failover node).

Now export the slices which shall be used for replication (/dev/da0s1d in my case). You do this by creating a file called /etc/gg.exports on the master server:

192.168.0.2 RW /dev/da0s1d

And the same on the standby server:

192.168.0.1 RW /dev/da0s1d

You’ll find more on this in the ggated man page. Basically you’re just exporting the underlying device to the given IP address in read/write mode.

Now since ggated does not support any password protection or encryption at all it is best to use a dedicated network for this anyway. This will also lower the load you place on the public network segment.For optimum performance Gigabit ethernet is recommended.

When you’re set with the config files, ggated must be started on the failover node (yes: the failover node, not on the master!). You do this by running:

#ggated -v

This will place ggated in verbose mode and run in foreground, which is useful for debugging purposes. Later on, when everything works fine, this can be omitted.

Please note that you should not export the partion on both nodes at the same time. Run ggated only on the host which is the current failover node. Use the freevrrpd master/backup scripts to start/stop the service as required.

#6 Import Disk Slices

Looking at the primary node, the remote disk slices must no be imported.

This is done through ggatec, the client component of ggated. Run it as follows:

#ggatec create 192.168.100.2 /dev/da0s1d

This command will return the device node name. If it is the first one created usally ‘ggate0′.

Consider that you should run ggatec only on the designated primary node. Use the freevrrpd master/backup script facilities to create/delete the ggate device node according to it’s state.

Do not create the device node on the failover node as long as it is not in primary state. Do not delete the device node as long as the host is in master state (except for recovery purpose, but this will be covered later).

#7 Setup Replication

Now it’s actually time to bring up replication. This is where gmirror kernel module enabled previously comes in handy.

Make sure you’re on the primary node, then initialize a new GEOM mirror:

#gmirror label -v -n -b prefer gm0 /dev/ggate0

Then insert the local disk slice:

#gmirror insert -p 100 gm0 /dev/ad0s1e

Rebuild the mirror:

#gmirror rebuild rm0 ggate0

If you want to use the geom mirror auto synchronisation features, you can enable these as follows:

#gmirror configure -a gm0

This will cause the disk slices to be synchronized, actually the data from the local ad0s1e will be copied over to the ggate0 remote device.

This will surely take some time, depending on the size of your partition and the speed of your network. When finished, a message like this will appear in the dmesg log of your primary node:

You may have noticed the “prefer” balance algorithm. This setting actually means that read requests shall only be directed to the geom provider with the highest priority.

By adding the /dev/ad0s1e (which is always the local disk) with a priority of 100 (actually any priority highter then the one of ggate0 according to “gmirror list gm0″ output is fine) you force all read requests to be directed to this device only.

You could actually use the “round-robin” balance algorithm as well, however this requires fast network connection with low latency, otherwise your read performance will drop significantly.

You may now “newfs” your gm0 device, mount and use it as you would with any other data partition.

In the first place you should now test the setup. Monitor the system performance on both hosts by using “vmstat” or a similar tool. Keep an eye on network interface and IO statistics.

If you experience lags, timeouts or slowisch behaviour during usual actions like copying files and directories then the above will certainly help you. In most cases it’s related to network bandwidth or limits in disk IO.

#8 Failing-Over To The Standby Node

Now that your replication is up and running it’s time to test a failover scenario. We do it by hand so you can see what you actually need to put in freevrrpd master/backup scripts for this purpose.

So go and unplug your current master node (yes, really do it. If you don’t do it now you’ll never do it and it is likely to never work properly).

So you unplugged it? Fine, that’s what we want.Now connect to your failover node and stop the ggated service.

This should cause geom mirror to pick up the gm0 device with provider /dev/da0s1e automatically.

Now you must run fsck to ensure filesystem integrity (you really must do this as the filesystem will always be dirty):

#fsck -t ufs /dev/mirror/gm0

Then you can mount the device:

#mount /dev/mirror/gm0 /mnt

Step #9 will explain how the mirror may be rebuilt if the previous master node becomes available again.

#9 Recovering

To bring back the master host into the active combound you will need to make sure that the gm0 device is actually shut down on the failed host.

You remember that we enabled permanent loading of the geom mirror module previously?This is required to circumvent some problematic situations when kernel secure level is in effect. But it also means that geom mirror will automatically pick up the gm0 device. This will however prevent you from exporting the underlying device through geom gate so the gm0 must disabled first. You can do it like this:#gmirror stop gm0

As soon as it is stopped you may then run ggated to export the partiton (we’re doing it in debug mode):

#ggated -v

If you get an error stating failure to open the /dev/da0s1e device it may still be locked by the geom mirror class. Just look at “gmirror list” output and stop the device as required.If ggated is running after all, turn to your failover host and turn off auto configuration on the geom mirror:

#gmirror configure -n gm0

Then make the ggate device available to your node:

#ggatec create 192.168.100.2 /dev/da0s1e

Reinsert the ggate device to the geom mirror using a low priority of ‘0′#gmirror insert -p 0 gm0 /dev/ggate

and re-enable auto-configuration on the mirror

#gmirror cinfigure -a gm0

I’d recommend to always rebuild the mirror unless your absolutely sure that no new data has been added to the gm0 device in the meantime.

#gmirror rebuild gm0 ggate0

Make sure you give the ggate0 device as last argument which makes it the “sync target”. If you happen to do “gmirror rebuild gm0 da0s1e” accidentally this will sync the other way round leaving you most likely with corrupt or lost data.

The rebuild will take some time depending on the partitition size and network speed. After finishing you will see a message like this in your kernel log:

Now you will have to remove the local /dev/ad0s1d device from the mirror and reinsert it using a high priority:

#gmirror remove gm0 /dev/ad0s1d#gmirror insert -p 100 gm0 /dev/ad0s1d

The geom mirror will automatically rebuild the provider if required.

This is actually required to fix the read priority I previously talked about, although only required if you want the previous failover node to become your new master node.

If you do not intend switching designated roles and make your failed primary the active node again, have a look at the next sections.

#10 What If The Failover Node Fails?

Imagine you need to reboot your Failover Node, let’s say to install some updates. Or even more worse: It has rebooted due to some kernel panic, power loss or other real-life situations.In any case you should put the geom mirror on the master host into degraded mode by forcibly removing the ggate0 device:

When you’re on the master, just make sure the ggate0 is disconnected from the mirror:

The gm0 is now running in degraded state until you re-insert your fail-over node to the configuration.

There is no problem in doing it this way anyway as you have to do a full resync in either case afer the failover node is up again.

The reason to remove the ggate0 device is to prevent IO locking on the geom mirror device.

#11 How To Recover Replication

To bring back the fail-over host into the active combound you will need to make sure that the gm0 device is actually shut down on the failed host.

#gmirror stop gm0

As soon as it is stopped you may then run ggated to export the partiton (we’re doing it in debug mode):

#ggated -v

If you get an error stating failure to open the /dev/da0s1e device it may still be locked by the geom mirror class. Just look at “gmirror list” output and stop the device as required.If ggated is running after all, make the remote disk slice available on the other host:

#ggatec create 192.168.100.2 /dev/da0s1e

This will have the ggate0 device created and added automatically to your gm0 device.

I’d recommend to always rebuild the mirror unless your absolutely sure that no new data has been added to the gm0 device in the meantime.

#gmirror rebuild gm0 ggate0

Make sure you give the ggate0 device as last argument which makes it the “sync target”. If you happen to do “gmirror rebuild gm0 da0s1e” accidentally this will sync the other way round leaving you most likely with corrupt or lost data.

The rebuild will take some time depending on the partitition size and network speed. After finishing you will see a message like this in your kernel log:

don’t try any fancy primary-primary replication stuff, it is not possible

never (as in never) mount the filesystem (the underlying partition to be exact), on the failover node

to access the data mount the geom mirror device, hence it’s only possible on the master node. Don’t ever do it on the failover node unless you have taken proper recovery action as described above

always run fsck on the geom mirror after failover

it’s better not to mount the geom mirror through fstab automatically. Use some freevrrpd recovery magic instead

Always take backups. This solution is to allow realtime replication for HA services. It is no substitute for proper backups at any time.

#13 Security Considerations

As you may have noticed ggated doesn’t support any security or encryption mechanism by default. “Security” is only implemented upon IP based access restrictions combined with read/write or read/only flags.

To enhance security a bit you should always use a dedicated network interface for data replication, preferrably a private one which is not connected to the internet. Crossover host-to-host cabling is fine.

If you need to go over the (insecure) public network please use additonal firewall rules to block port access to authorized hosts.

Both ggated and ggatec also allow useing a port different from their default so it would be possible to setup a redirect through stunnel. This may however pose another performance impact onto your hosts, especially if your network connection is laggy or slow.

#14 Observations

It may look a bit complicated at a first glance, but it is basically nothing else than spanning a software RAID1 accross networked hosts.

In theory its possible to apply any RAID configuration supported by geom accross networked hosts, but there is no practical reason in doing so.

The possibilites offered by this setup are huge if implemented properly. You can easily apply HA conditions to services which do not support such on their own.

If you happen to implement a live environment upon this technology some time, just let me know how it worked out.

This entry was posted on Friday, August 11th, 2006 at 10:22 am and is filed under HA. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

7 Responses to “Realtime File System Replication On FreeBSD”

I have this in production for our CVS server which is also serving Mantics and MediaWiki. Everything works great until someone checks in a 40MB file. That seems to clog the dedicated gigabit ethernet connection and cause some sort of deadlock. At that point I must break the mirror (gmirror deactivate data ggate0) and all works just fine. Then I have to rebuild, which takes about 3-4 hours for the 300G partition that is being replicated.

I have increased the ggate buffers to 8192 (ggatec create -q 8192 10.10.10.x /dev/ar1), but that doesn’t fix the problem.

I have tried this on my servers, however the replication performance is very poor. It takes more than 4 hours to rebuild 160GB of partition over gigabit connection and sometimes I had timeout if someone is trying to write big files.After I tweak my sysctl intonet.inet.tcp.sendspace=131072net.inet.tcp.recvspace=262144kern.ipc.maxsockbuf=1048576now it takes only half an hour to rebuild 160GB partition and no more timeout. Further information can be found on http://www.geekfarm.org/wu/muse/GeomGate.html

There's clustering and clustering. Neither of the two applications
the OP mentioned needs anything like as tight a coupling as what many
commercial 'cluster' solutions provide, or that compute-cluster solutions
like Beowulf or Grid Engine[!] provide.

WWW clustering requires two things:

* A means to detect failed / out of service machines and redirect traffic to alternative servers

* A means to delocalize user sessions between servers

The first requirement can be handled with programs already mentioned

such as wackamole/spread or hacluster -- or another alternative is hoststated(8)[*] on OpenBSD. You can use mod_proxy_balancer[+] on recent Apache 2.2.x to good effect. Certain web technologies provide this

sort of capability directly: eg. mod_jk or the newer mod_proxy_ajp13

modules for apache can balance traffic across a number of back-end tomcat workers: of course this only applies to sites written in Java.

If you're dealing with high traffic levels and have plenty of money to spend, then a hardware load balancer (Cisco Arrowpoint, Alteon Acedirector, Foundry ServerIron etc.) is a pretty standard choice.

The second requirement is more subtle. Any reasonably complicated
web application nowadays is unlikely to completely stateless. Either
you have to recognise each session and direct the traffic back to the
same server each time, or you have to store the session state in a way

that is accessible to all servers -- typically in a back-end database. Implementing 'sticky sessions' is generally slightly easier in terms of application programming, but less resilient to machine failure. There

are other alternatives: Java Servlet based applications running under
Apache Tomcat can cluster about 4 machines together so that session
state is replicated to all of them. This solution is however not at
all scalable beyond 4 machines, as they'll quickly spend more time passing

state information between themselves than they do actually serving incoming web queries.

Mail clustering is an entirely different beast. In fact, it's two
different beasts with entirely different characteristics.

The easy part with mail is the MTA -- SMTP has built in intrinsic concepts of fail-over and retrying with alternate servers. Just set up appropriate MX records in the DNS pointing at a selection of servers and it all should work pretty much straight away. You may need to share certain data between your SMTP servers (like greylisting status, Bayesian spam filtering, authentication databases) but the software is generally written with this capability built in.

The hard part with mail clustering is the mail store which provides the

IMAP or POP3 or WebMail interface to allow users to actually read their mail. To my knowledge there is no freely available opensource solution

that provides an entirely resilient IMAP/POP3 solution. Cyrus Murder

comes close, in that it provides multiple back-end mail stores, easy migration of mailboxes between stores and resilient front ends. The typical approach here is to use a high-spec server with RAIDed disk systems, multiple PSUs etc. and to keep very good backups.

Cheers,

Matthew
==============================================
High Availability means that your cluster should work even some system components fail.

http://en.wikipedia.org/wiki/High-availability_cluster

For building HA cluster you should have at last two machines, first will run in master mode, second in slave( standby )mode.

In every time only one machine works and provide some services (www, db, etc)

Very good idea is to use NAS(SAN) - Network Access Storage ( http://en.wikipedia.org/wiki/Network-attached_storage ) with shared disk. Both nodes of HA cluster will use this shared disk (but only one in certain time). If one node fails, second node (standby node) will become a master of cluster and will start some services, that cluster provided.

But NAS systems is not cheap!!

Another way is to use software systems such us DRBD, NFS, chironfs, rsync etc. Most of this high-availability software solution works by replicating a disk partition in a master/slave mode.

Heartbeat + DRBD is one of most popular redundant solutions.

DRBD mirrors a partition between two machines allowing only one of them to mount it at a time. Heartbeat then monitors the machines, and if it detects that one of the machines has died, it takes control by mounting the mirrored disk and starting all the services the other machine is running. Unfortunately DRBD runs only on linux but I recommend you to see how it works for understanding this technology.

I have been running freevrrpd and pen (http://siag.nu/pen/ or in ports) for HA web services.

My setup was a firewall/gateway consisting of more than 1 machine using freevrrpd thus enabling failover for the firewall/gateway. I write firewall and not firewalls since freevrrpd creates a virtual ip that is failover'ed between the machines.

On the firewall/gateway pen were running and pointed towards the web servers. Pen can point at as many web servers as you like and balances the load between them in a very simple way. If the web servers are identical in setup they become redundant. DNS loadbalancing is very similar.

Good luck!

/Roger

====================================================
CARP does the job perfectly!

Is you have to LB/RP from a front end (the SPOF?) you can also take a
quick look on LighttpD with the Proxy module (very simple & efficient)