We've been targeting malware for over a year and a half, and these efforts are paying off. We are now able to display warnings in search results when a site is known to be malicious, which can help you avoid drive-by downloads and other computer compromises. We are already distributing this data through the Safe Browsing API, and we are working on bringing this protection to more users by integrating with more Google products. While these are great steps, we need your help going forward!

Currently, we know of hundreds of thousands of websites that attempt to infect people's computers with malware. Unfortunately, we also know that there are more malware sites out there. This is where we need your help in filling in the gaps. If you come across a site that is hosting malware, we now have an easy way for you to let us know about it. If you come across a site that is hosting malware, please fill out this short form. Help us keep the internet safe, and report sites that distribute malware.

Google encourages its employees to contribute back to the open source community, and there is no exception in Google's Security Team. Let's look at some interesting open source vulnerabilities that were located and fixed by members of Google's Security team. It is interesting to classify and aggregate the code flaws leading to the vulnerabilities, to see if any particular type of flaw is more prevalent.

JDK. In May 2007, I released details on an interesting bug in the ICC profile parser in Sun's JDK. The bug is particularly interesting because it could be exploited by an evil image. Most previous JDK bugs involve a user having to run a whole evil applet. The key parts of code which demonstrate the bug are as follows:

Both TagSize and TagOffset are untrusted unsigned 32-bit values pulled out of images being parsed. They are added together, causing a classic integer overflow condition and the bypass of the size check. A subsequent additional integer overflow in the allocation of a buffer leads to a heap-based buffer overflow.

gunzip. In September 2006, my colleague Tavis Ormandy reported some interesting vulnerabilities in the gunzip decompressor. They were triggered when an evil compressed archive is decompressed. A lot of programs will automatically pass compressed data through gunzip, making it an interesting attack. The key parts of the code which demonstrate one of the bugs are as follows:

Here, the stack-based array "count" is indexed by values in the "bitlen" array. These values are under the control of data in the incoming untrusted compressed data, and were not checked for being within the bounds of the "count" array. This led to corruption of data on the stack.

libtiff. In August 2006, Tavis reported a range of security vulnerabilities in the libtiff image parsing library. A lot of image manipulation programs and services will be using libtiff if they handle TIFF format files. So, an evil TIFF file could compromise a lot of desktops or even servers. The key parts of the code which demonstrate one of the bugs are as follows:

Here, a TIFF file containing a JPEG image is being processed. In this case, both the TIFF header and the embedded JPEG image contain their own copies of the width and height of the image in pixels. This check above notices when these values differ, issues a warning, and continues. The destination buffer for the pixels is allocated based on the TIFF header values, and it is filled based on the JPEG values. This leads to a buffer overflow if a malicious image file contains a JPEG with larger dimensions than those in the TIFF header. Presumably the intent here was to support broken files where the embedded JPEG had smaller dimensions than those in the TIFF header. However, the consequences of larger dimensions that those in the TIFF header had not been considered.

We can draw some interesting conclusions from these bugs. The specific vulnerabilities are integer overflows, out-of-bounds array accesses and buffer overflows. However, the general theme is using an integer from an untrusted source without adequately sanity checking it. Integer abuse issues are still very common in code, particular code which is decoding untrusted binary data or protocols. We recommend being careful using any such code until it has been vetted for security (by extensive code auditing, fuzz testing, or preferably both). It is also important to watch for security updates for any decoding software you use, and keep patching up to date.

Security testing of applications is regularly performed using fuzz testing. As previously discussed on this blog, Srinath's Lemon uses a form of smart fuzzing. Lemon is aware of classes of web application threats and the input families which trigger them, but not all fuzz testing frameworks have to be this complicated. Fuzz testing originally relied on purely random data, ignorant of specific threats and known dangerous input. Today, this approach is often overlooked in favor of more complicated techniques. Early sanity checks in applications looking for something as a simple as a version number may render testing with completely random input ineffective. However, the newer, more complicated fuzz testers require a considerable initial investment in the form of complete input format specifications or the selection of a large corpus of initial input samples.

At WOOT'07,I presented a paper on Flayer, a tool we developed internally to augment our security testing efforts. In particular, it allows for a fuzz testing technique that compromises between the original idea and the most complicated. Flayer makes it possible to remove input sanity checks at execution time. With the small investment of identifying these checks, Flayer allows for completely random testing to be performed with much higher efficacy. Already, we've uncovered multiple vulnerabilities in Internet-critical software using this approach.

The way that Flayer allows for sanity checks to be identified is perhaps the more interesting point. Flayer uses a dynamic analysis framework to analyze the target application at execution time. Flayer marks, or taints, input to the program and traces that data throughout its lifespan. Considerable research has been done in the past regarding information flow tracing using dynamic analysis. Primarily, this work has been aimed at malware and exploit detection and defense. However, none of the resulting software has been made publicly available.

While Flayer is still in its early stages, it is available for download under the GNU Public License. External contributions and feedbackare encouraged!

Cross-site scripting (aka XSS) is the term used to describe a class of security vulnerabilities in web applications. An attacker can inject malicious scripts to perform unauthorized actions in the context of the victim's web session. Any web application that serves documents that include data from untrusted sources could be vulnerable to XSS if the untrusted data is not appropriately sanitized. A web application that is vulnerable to XSS can be exploited in two major ways:

Stored XSS - Commonly exploited in a web application where one user enters information that's viewed by another user. An attacker can inject malicious scripts that are executed in the context of the victim's session. The exploit is triggered when a victim visits the website at some point in the future, such as through improperly sanitized blog comments and guestbook entries, which facilitates stored XSS.

Reflected XSS - An application that echoes improperly sanitized user input received as query parameters is vulnerable to reflected XSS. With a vulnerable application, an attacker can craft a malicious URL and send it to the victim via email or any other mode of communication. When the victim visits the tampered link, the page is loaded along with the injected script that is executed in the context of the victim's session.

The general principle behind preventing XSS is the proper sanitization (via, for instance, escaping or filtering) of all untrusted data that is output by a web application. If untrusted data is output within an HTML document, the appropriate sanitization depends on the specific context in which the data is inserted into the HTML document. The context could be in the regular HTML body, tag attributes, URL attributes, URL query string attributes, style attributes, inside JavaScript, HTTP response headers, etc.

The following are some (by no means complete) examples of XSS vulnerabilities. Let's assume there is a web application that accepts user input as the 'q' parameter. Untrusted data coming from the attacker is marked in red.

In the cases where XSS arises from meta characters being inserted from untrusted sources into an HTML document, the issue can be avoided either by filtering/disallowing the meta characters, or by escaping them appropriately for the given HTML context. For example, the HTML meta characters <, >, &, " and ' must be replaced with their corresponding HTML entity references &lt;, &gt;, &amp;, &quot; and &#39 respectively. In a JavaScript-literal context, inserting a backslash in front of \, ', " and converting the carriage returns, line-feeds and tabs into \r, \n and \t respectively should avoid untrusted meta characters being interpreted as code.

How about an automated tool for finding XSS problems in web applications? Our security team has been developing a black box fuzzing tool called Lemon (deriving from the commonly-recognized name for a defective product). Fuzz testing (also referred to as fault-injection testing) is an automated testing approach based on supplying inputs that are designed to trigger and expose flaws in the application. Our vulnerability testing tool enumerates a web application's URLs and corresponding input parameters. It then iteratively supplies fault strings designed to expose XSS and other vulnerabilities to each input, and analyzes the resulting responses for evidence of such vulnerabilities. Although it started out as an experimental tool, it has proved to be quite effective in finding XSS problems. Besides XSS, it finds other security problems such as response splitting attacks, cookie poisoning problems, stacktrace leaks, encoding issues and charset bugs. Since the tool is homegrown it is easy to integrate into our automated test environment and to extend based on specific needs. We are constantly in the process of adding new attack vectors to improve the tool against known security problems.

Update:I wanted to respond to a few questions that seem to be common among readers. I've listed them below. Thanks for the feedback. Please keep the questions and comments coming.

Q. Does Google plan to market it at some point?A. Lemon is highly customized for Google apps and we have no plans of releasing it in near future.

Q. Did Google's security team check out any commercially available fuzzers? Is the ability to keep improving the fuzzer the main draw of a homegrown tool?A. We did evaluate commercially available fuzzers but felt that our specialized needs could be served best by developing our own tools.

Some of you might have seen this message while searching on Google, and wondered what the reason behind it might be. Instead of search results, Google displays the "We're sorry" message when we detect anomalous queries from your network. As a regular user, it is possible to answer a CAPTCHA - a reverse Turing test meant to establish that we are talking to a human user - and to continue searching. However, automated processes such as worms would have a much harder time solving the CAPTCHA. Several things can trigger the sorry message. Often it's due to infected computers or DSL routers that proxy search traffic through your network - this may be at home or even at a workplace where one or more computers might be infected. Overly aggressive SEO ranking tools may trigger this message, too. In other cases, we have seen self-propagating worms that use Google search to identify vulnerable web servers on the Internet and then exploit them. The exploited systems in turn then search Google for more vulnerable web servers and so on. This can lead to a noticeable increase in search queries and sorry is one of our mechanisms to deal with this.

At ACM WORM 2006, we published a paper on Search Worms [PDF] that takes a much closer look at this phenomenon. Santy, one of the search worms we analyzed, looks for remote-execution vulnerabilities in the popular phpBB2 web application. In addition to exhibiting worm like propagation patterns, Santy also installs a botnet client as a payload that connects the compromised web server to an IRC channel. Adversaries can then remotely control the compromised web servers and use them for DDoS attacks, spam or phishing. Over time, the adversaries have realized that even though a botnet consisting of web servers provides a lot of aggregate bandwidth, they can increase leverage by changing the content on the compromised web servers to infect visitors and in turn join the computers of compromised visitors into much larger botnets. This fundamental change from remote attack to client based download of malware formed the basis of the research presented in our first post. In retrospect, it is interesting to see how two seemingly unrelated problems are tightly connected.

OK, so it might be a little early to declare victory, but we're excited about the Safe Browsing API we launched today. It provides a simple mechanism for downloading Google's lists of suspected phishing and malware URLs, so now any developer can access the blacklists used in products such as Firefox and Google Desktop.

The API is still experimental, but we hope it will be useful to ISPs, web-hosting companies, and anyone building a site or an application that publishes or transmits user-generated links. Sign up for a key and let us know how we can make the API better. We fully expect to iterate on the design and improve the data behind the API, and we'll be paying close attention to your feedback as we do that. We look forward to hearing your thoughts.

In addition to targeting malware, we're interested in combating phishing, a social engineering attack where criminals attempt to lure unsuspecting web surfers into logging into a fake website that looks like a real website, such as eBay, E-gold or an online bank. Following a successful attack, phishers can steal money out of the victims' accounts or take their identities. To protect our users against phishing, we publish a blacklist of known phishing sites. This blacklist is the basis for the anti-phishing features in the latest versions of Firefox and Google Desktop. Although blacklists are necessarily a step behind as phishers move their phishing pages around, blacklists have proved to be reasonably effective.

Not all phishing attacks target sites with obvious financial value. Beginning in mid-March, we detected a five-fold increase in overall phishing page views. It turned out that the phishing pages generating 95% of the new phishing traffic targeted MySpace, the popular social networking site. While a MySpace account does not have any intrinsic monetary value, phishers had come up with ways to monetize this attack. We observed hijacked accounts being used to spread bulletin board spam for some advertising revenue. According to this interview with a phisher, phishers also logged in to the email accounts of the profile owners to harvest financial account information. In any case, phishing MySpace became profitable enough (more than phishing more traditional targets) that many of the active phishers began targeting it.

Interestingly, the attack vector for this new attack appeared to be MySpace itself, rather than the usual email spam. To observe the phishers' actions, we fed them the login information for a dummy MySpace account. We saw that when phishers compromised a MySpace account, they added links to their phishing page on the stolen profile, which would in turn result in additional users getting compromised. Using a quirk of the CSS supported in MySpace profiles, the phishers injected these links invisibly as see-through images covering compromised profiles. Clicking anywhere on an infected profile, including on links that appeared normal, redirected the user to a phishing page. Here's a sample of some CSS code injected into the "About Me" section of an affected profile:

In addition to contributing to the viral growth of the phishing attack, linking directly off of real MySpace content added to the appearance of legitimacy of these phishing pages. In fact, we received thousands of complaints from confused users along the lines of "Why won't it let any of my friends look at my pictures?" regarding our warnings on these phishing pages, suggesting that even an explicit warning was not enough to protect many users. The effectiveness of the attack and the increasing sophistication of the phishing pages, some of which were hosted on botnets and were near perfect duplications of MySpace's login page, meant that we needed to switch tactics to combat this new threat.

In late March, we reached out to MySpace to see what we could do to help. We provided lists of the top phishing sites and our anti-phishing blacklist to MySpace so that they could disable compromised accounts with links to those sites. Unfortunately, many of the blocked users did not remove the phishing links when they reactivated their accounts, so the attacks continued to spread. On April 19, MySpace updated their server software so that they could disable bad links in users' profiles without requiring any user action or altering any other profile content. Overnight, overall phishing traffic dropped by a factor of five back to the levels observed in early March. While MySpace phishing continues at much lower volumes, phishers are beginning to move on to new targets.

Update your software regularly and run an anti-virus program. If a cyber-criminal gains control of your computer through a virus or a software security flaw, he doesn't need to resort to phishing to steal your information.

Use different passwords on different sites and change them periodically. Phishers routinely try to log in to high-value targets, like online banking sites, with the passwords they steal for lower-value sites, like webmail and social networking services.

In this post, we investigate the distribution of web server software to provide insight into how server software is correlated to servers hosting malware binaries or engaging in drive-by-downloads.

We determine server operating system by examining the 'Server:' HTTP header reported by most web servers. A survey of servers running roughly 80 million domain names reveals the web server software distribution shown below. Note that these figures may have some margin of error as it is not unusual to find hundreds of domains served by a single IP address.

Web server software across the Internet.

Web server software distribution across the Internet.

Our numbers report a slightly larger fraction of Apache servers compared to the Netcraft web server survey. Our analysis is based on crawl information and only root URLs were examined, therefore hosts that did not present a root URL (e.g. /index.htm) were not included in the statistics. This may have contributed to the disparity with the Netcraft numbers.

Amongst Apache servers, about 35% did not report any version information. Presumably the lack of version information is considered to be a defense against version specific attacks and worms. We observed a long tail of Apache server versions; the top three detected were 1.3.37 (15%), 1.3.33 (7.91%), and 2.0.54 (6.25%).

Amongst Microsoft servers, IIS 6.0 is by far the most popular version, making up about 80% of all IIS servers. IIS 5.0 made up most of the remainder.

Web server software across servers distributing malware.

We examined about 70,000 domains that over the past month have been either distributing malware or have been responsible for hosting browser exploits leading to drive-by-downloads. The breakdown by server software is depicted below. It is important to note that while many servers serve malware as a result of a server compromise (by remote exploits, password theft via keyloggers, etc.), some servers are configured to serve up exploits by their administrators.

Web server software distribution across malicious servers.

Compared to our sample of servers across the Internet, Microsoft IIS features twice as often (49% vs. 23%) as a malware distributing server. Amongst Microsoft IIS servers, the share of IIS 6.0 and IIS 5.0 remained the same at 80% and 20% respectively.

The distribution of top featured Apache server versions was different this time: 1.3.37 (50%), 1.3.34 (12%) and 1.3.33 (5%). 21% of the Apache servers did not report any version information. Incidentally, version 1.3.37 is the latest Apache server release in the 1.3 series, and it is hence somewhat of a surprise that this version features so prominently. One other factor we observe is a vast collection of Apache modules in use.

Distribution of web server software by country.

Web server distribution by country

Malicious web server distribution by country

The figure on the left shows the distribution of all Apache, IIS, and nginx webservers by country. Apache has the largest share, even though there is noticeable variation between countries. The figure on the right shows the distribution, by country, of webserver software of servers either distributing malware or hosting browser exploits. It is very interesting to see that in China and South Korea, a malicious server is much more likely to be running IIS than Apache.

We suspect that the causes for IIS featuring more prominently in these countries could be due to a combination of factors: first, automatic updates have not been enabled due to software piracy (piracy statistics from NationMaster, and BSA), and second, some security patches are not available for pirated copies of Microsoft operating systems. For instance the patch for a commonly seen ADODB.Stream exploit is not available to pirated copies of Windows operating systems.

Overall, we see a mix of results. In Germany, for instance, Apache is more likely to be serving malware than Microsoft IIS, compared to the overall distributions of these servers. In Asia, we see the reverse, which is part of the cause of Microsoft IIS having a disproportionately high representation at 49% of malware servers. In summary, our analysis demonstrates how important it is to keep web servers patched to the latest patch level.

Following Panayiotis' and Niels' post on malware, I'd like to discuss a somewhat related topic, virtualisation. Virtual machines are often used by security researchers to sandbox malware samples for analysis, or to protect a machine from a potentially hazardous activity. The theory is that any security threat or malicious behaviour will be restricted to the virtual environment which can be discarded and then restored to pristine condition after use.

Virtual machines are sometimes thought of as impenetrable barriers between the guest and host, but in reality they're (usually) just another layer of software between you and the attacker. As with any complex application, it would be naive to think such a large codebase could be written without some serious bugs creeping in. If any of those bugs are exploitable, attackers restricted to the guest could potentially break out onto the host machine. I investigated this topic earlier this year, and presented a paper at CanSecWest on a number of ways that an attacker could break out of a virtual machine.

Most of the attacks identified were flaws, such as buffer overflows, in emulated hardware devices. One example of this is missing bounds checking in bitblt routines, which are used for moving rectangular blocks of data around the display. If exploited, by specifying pathological parameters for the operation, this could lead to an attacker compromising the virtual machine process. While you would typically require root (or equivalent) privileges in the guest to interact with a device at the low level required, device drivers will often offload the parameter checking required onto the hardware, so in theory an unprivileged attacker could be able to access flaws like this by simply interacting with the regular API or system call interface provided by the guest operating system.

While researching this topic we worked with the vendors affected to make sure they were aware of our findings, and provided patches where possible. I've also suggested some precautions virtualization you can take to minimise the impact of any flaws like this discovered in future, such as:

Reduce the attack surface

By disabling emulated devices, features and services you don't need you reduce the amount of code exposed to an attacker, thus reducing the number of possible bugs that can be exploited. You should also aim to protect the integrity of the guest operating system, making it harder for an attacker to get lower level access to emulated hardware. By keeping software in the guest up to date, and hardening it by locking down the operating system and minimising what is run with root or admin privileges, you can reduce the risk of privilege escalation attacks. If an attacker cannot get low level access to the emulated hardware, it will be more difficult to exploit the bugs in them. Remember that some legacy operating systems make no attempt to restrict access to I/O ports and similar interfaces, these should be used with caution in a security sensitive context.

Treat virtual machines as services that can be compromised

Most administrators will take steps to limit the impact of a compromise of a network facing daemon, such as using chroot() or running the daemon as a low privileged user. These same tactics can be applied to your virtual machine. As always, try to minimise what has to run as root or administrator.

Keep software up to date

Keep your virtual machine software up to date, and look out for any security advisories from your vendor so that you can apply any patches promptly.

Online security is an important topic for Google, our users, and anyone who uses the Internet. The related issues are complex and dynamic and we've been looking for a way to foster discussion on the topic and keep users informed. Thus, we've started this blog where we hope to periodically provide updates on recent trends, interesting findings, and efforts related to online security. Among the issues we'll tackle is malware, which is the subject of our inaugural post.

Malware -- surreptitious software capable of stealing sensitive information from your computer -- is increasingly spreading over the web. Visiting a compromised web server with a vulnerable browser or plugins can result in your system being infected with a whole variety of malware without any interaction on your part. Software installations that leverage exploits are termed "drive-by downloads". To protect Google's users from this threat, we started an anti-malware effort about a year ago. As a result, we can warn you in our search results if we know of a site to be harmful and even prevent exploits from loading with Google Desktop Search.

Unfortunately, the scope of the problem has recently been somewhat misreported to suggest that one in 10 websites are potentially malicious. To clarify, a sample-based analysis puts the fraction of malicious pages at roughly 0.1%. The analysis described in our paper covers billions of URLs. Using targeted feature extraction and classification, we select a subset of URLs believed to be suspicious for in-depth investigation. So far, we have investigated about 12 million suspicious URLs and found about 1 million that engage in drive-by downloads. In most cases, the web sites that infect your system with malware are not intentionally doing so and are often unaware that their web servers have been compromised.

To get a better understanding about the geographic distribution of sites engaging in drive-by downloads, we analyzed the location of compromised web sites and the location of malware distribution hosts. At the moment, the majority of malware activity seems to happen in China, the U.S., Germany and Russia (see below):

Location of compromised web sites.These are often sites that are benign in nature but have been compromised and have become dangerous for users to visit.

Location of malware distribution servers.These are servers that are used by malware authors to distribute their payload. Very often the compromised sites are modified to include content from these servers. The color coding works as follows: Green means that we did not find anything unsual in that country, yellow means low activity, orange medium activity and red high activity.

Guidelines on safe browsingFirst and foremost, enable automatic updates for your operating system as well your browsers, browser plugins and other applications you are using. Automatic updates ensure that your computer receives the latest security patches as they are published. We also recommend that you run an anti-virus engine that checks network traffic and files on your computer for known malware and abnormal behavior. If you want to be really sure that your system does not become permanently compromised, you might even want to run your browser in a virtual machine, which you can revert to a clean snapshot after every browsing session.