Unfortunately, it seems to have devolved into mostly a get-rich-quick scheme for nerds, and by nearly any measure it’s turning into a spectacular catastrophe. Its “success” is measured in how much a bitcoin is worth in US dollars, which is pretty close to an admission from its own investors that its only value is in converting back to “real” money — all while that same “success” is making it less useful as a distinct currency.

Blah, blah, everyone already knows this.

What concerns me slightly more is the gold rush hype cycle, which is putting cryptocurrency and “blockchain” in the news and lending it all legitimacy. People have raked in millions of dollars on ICOs of novel coins I’ve never heard mentioned again. (Note: again, that value is measured in dollars.) Most likely, none of the investors will see any return whatsoever on that money. They can’t, really, unless a coin actually takes off as a currency, and that seems at odds with speculative investing since everyone either wants to hoard or ditch their coins. When the coins have no value themselves, the money can only come from other investors, and eventually the hype winds down and you run out of other investors.

I fear this will hurt a lot of people before it’s over, so I’d like for it to be over as soon as possible.

That said, the hype itself has gotten way out of hand too. First it was the obsession with “blockchain” like it’s a revolutionary technology, but hey, Git is a fucking blockchain. The novel part is the way it handles distributed consensus (which in Git is basically left for you to figure out), and that’s uniquely important to currency because you want to be pretty sure that money doesn’t get duplicated or lost when moved around.

But now we have startups trying to use blockchains for website backends and file storage and who knows what else? Why? What advantage does this have? When you say “blockchain”, I hear “single Git repository” — so when you say “email on the blockchain”, I have an aneurysm.

Bitcoin seems to have sparked imagination in large part because it’s decentralized, but I’d argue it’s actually a pretty bad example of a decentralized network, since people keep forking it. The ability to fork is a feature, sure, but the trouble here is that the Bitcoin family has no notion of federation — there is one canonical Bitcoin ledger and it has no notion of communication with any other. That’s what you want for currency, not necessarily other applications. (Bitcoin also incentivizesfrivolous forking by giving the creator an initial pile of coins to keep and sell.)

And federation is much more interesting than decentralization! Federation gives us email and the web. Federation means I can set up my own instance with my own rules and still be able to meaningfully communicate with the rest of the network. Federation has some amount of tolerance for changes to the protocol, so such changes are more flexible and rely more heavily on consensus.

Federation is fantastic, and it feels like a massive tragedy that this rekindled interest in decentralization is mostly focused on peer-to-peer networks, which do little to address our current problems with centralized platforms.

Again, the tech is cool and all, but the marketing hype is getting way out of hand.

Maybe what I really want from 2018 is less marketing?

For one, I’ve seen a huge uptick in uncritically referring to any software that creates or classifies creative work as “AI”. Can we… can we not. It’s not AI. Yes, yes, nerds, I don’t care about the hair-splitting about the nature of intelligence — you know that when we hear “AI” we think of a human-like self-aware intelligence. But we’re applying it to stuff like a weird dog generator. Or to whatever neural network a website threw into production this week.

And this is dangerously misleading — we already had massive tech companies scapegoating The Algorithm™ for the poor behavior of their software, and now we’re talking about those algorithms as though they were self-aware, untouchable, untameable, unknowable entities of pure chaos whose decisions we are arbitrarily bound to. Ancient, powerful gods who exist just outside human comprehension or law.

It’s weird to see this stuff appear in consumer products so quickly, too. It feels quick, anyway. The latest iPhone can unlock via facial recognition, right? I’m sure a lot of effort was put into ensuring that the same person’s face would always be recognized… but how confident are we that other faces won’t be recognized? I admit I don’t follow all this super closely, so I may be imagining a non-problem, but I do know that humans are remarkably bad at checking for negative cases.

Hell, take the recurring problem of major platforms like Twitter and YouTube classifying anything mentioning “bisexual” as pornographic — because the word is also used as a porn genre, and someone threw a list of porn terms into a filter without thinking too hard about it. That’s just a word list, a fairly simple thing that any human can review; but suddenly we’re confident in opaque networks of inferred details?

I don’t know. “Traditional” classification and generation are much more comforting, since they’re a set of fairly abstract rules that can be examined and followed. Machine learning, as I understand it, is less about rules and much more about pattern-matching; it’s built out of the fingerprints of the stuff it’s trained on. Surely that’s just begging for tons of edge cases. They’re practically made of edge cases.

I’m reminded of a point I saw made a few days ago on Twitter, something I’d never thought about but should have. TurnItIn is a service for universities that checks whether students’ papers match any others, in order to detect cheating. But this is a paid service, one that fundamentally hinges on its corpus: a large collection of existing student papers. So students pay money to attend school, where they’re required to let their work be given to a third-party company, which then profits off of it? What kind of a goofy business model is this?

And my thoughts turn to machine learning, which is fundamentally different from an algorithm you can simply copy from a paper, because it’s all about the training data. And to get good results, you need a lot of training data. Where is that all coming from? How many for-profit companies are setting a neural network loose on the web — on millions of people’s work — and then turning around and selling the result as a product?

This is really a question of how intellectual property works in the internet era, and it continues our proud decades-long tradition of just kinda doing whatever we want without thinking about it too much. Nothing if not consistent.

A bit tougher, since computers are pretty alright now and everything continues to chug along. Maybe we should just quit while we’re ahead. There’s some real pie-in-the-sky stuff that would be nice, but it certainly won’t happen within a year, and may never happen except in some horrific Algorithmic™ form designed by people that don’t know anything about the problem space and only works 60% of the time but is treated as though it were bulletproof.

The giants are getting more giant. Maybe too giant? Granted, it could be much worse than Google and Amazon — it could be Apple!

Amazon has its own delivery service and brick-and-mortar stores now, as well as providing the plumbing for vast amounts of the web. They’re not doing anything particularly outrageous, but they kind of loom.

Ad company Google just put ad blocking in its majority-share browser — albeit for the ambiguously-noble goal of only blocking obnoxious ads so that people will be less inclined to install a blanket ad blocker.

Twitter is kind of a nightmare but no one wants to leave. I keep trying to use Mastodon as well, but I always forget about it after a day, whoops.

Facebook sounds like a total nightmare but no one wants to leave that either, because normies don’t use anything else, which is itself direly concerning.

IRC is rapidly bleeding mindshare to Slack and Discord, both of which are far better at the things IRC sadly never tried to do and absolutely terrible at the exact things IRC excels at.

The problem is the same as ever: there’s no incentive to interoperate. There’s no fundamental technical reason why Twitter and Tumblr and MySpace and Facebook can’t intermingle their posts; they just don’t, because why would they bother? It’s extra work that makes it easier for people to not use your ecosystem.

I don’t know what can be done about that, except that hope for a really big player to decide to play nice out of the kindness of their heart. The really big federated success stories — say, the web — mostly won out because they came along first. At this point, how does a federated social network take over? I don’t know.

I… don’t really have a solid grasp on what’s happening in tech socially at the moment. I’ve drifted a bit away from the industry part, which is where that all tends to come up. I have the vague sense that things are improving, but that might just be because the Rust community is the one I hear the most about, and it puts a lot of effort into being inclusive and welcoming.

So… more projects should be like Rust? Do whatever Rust is doing? And not so much what Linus is doing.

I haven’t heard this brought up much lately, but it would still be nice to see. The Bay Area runs on open source and is raking in zillions of dollars on its back; pump some of that cash back into the ecosystem, somehow.

I’ve seen a couple open source projects on Patreon, which is fantastic, but feels like a very small solution given how much money is flowing through the commercial tech industry.

One might wonder where the money to host a website comes from, then? I don’t know. Maybe we should loop this in with the above thing and find a more informal way to pay people for the stuff they make when we find it useful, without the financial and cognitive overhead of A Transaction or Giving Someone My Damn Credit Card Number. You know, something like Bitco— ah, fuck.

More than six years ago in January 2012, file-hosting site Megaupload was shut down by the United States government and founder Kim Dotcom and his associates were arrested in New Zealand.

What followed was an epic legal battle to extradite Dotcom, Mathias Ortmann, Finn Batato, and Bram van der Kolk to the United States to face several counts including copyright infringement, racketeering, and money laundering. Dotcom has battled the US government every inch of the way.

The most significant matters include the validity of the search warrants used to raid Dotcom’s Coatesville home on January 20, 2012. Despite a prolonged trip through the legal system, in 2014 the Supreme Court dismissed Dotcom’s appeals that the search warrants weren’t valid.

In 2015, the District Court later ruled that Dotcom and his associates are eligible for extradition. A subsequent appeal to the High Court failed when in February 2017 – and despite a finding that communicating copyright-protected works to the public is not a criminal offense in New Zealand – a judge also ruled in favor.

Of course, Dotcom and his associates immediately filed appeals and today in the Court of Appeal in Wellington, their hearing got underway.

Lawyer Grant Illingworth, representing Van der Kolk and Ortmann, told the Court that the case had “gone off the rails” during the initial 10-week extradition hearing in 2015, arguing that the case had merited “meaningful” consideration by a judge, something which failed to happen.

“It all went wrong. It went absolutely, totally wrong,” Mr. Illingworth said. “We were not heard.”

As expected, Illingworth underlined the belief that under New Zealand law, a person may only be extradited for an offense that could be tried in a criminal court locally. His clients’ cases do not meet that standard, the lawyer argued.

Turning back the clocks more than six years, Illingworth again raised the thorny issue of the warrants used to authorize the raids on the Megaupload defendants.

It had previously been established that New Zealand’s GCSB intelligence service had illegally spied on Dotcom and his associates in the lead up to their arrests. However, that fact was not disclosed to the District Court judge who authorized the raids.

“We say that there was misleading conduct at this stage because there was no reference to the fact that information had been gathered illegally by the GCSB,” he said.

But according to Justice Forrest Miller, even if this defense argument holds up the High Court had already found there was a prima facie case to answer “with bells on”.

“The difficulty that you face here ultimately is whether the judicial process that has been followed in both of the courts below was meaningful, to use the Canadian standard,” Justice Miller said.

“You’re going to have to persuade us that what Justice Gilbert [in the High Court] ended up with, even assuming your interpretation of the legislation is correct, was wrong.”

Although the US seeks to extradite Dotcom and his associates on 13 charges, including racketeering, copyright infringement, money laundering and wire fraud, the Court of Appeal previously confirmed that extradition could be granted based on just some of the charges.

The stakes couldn’t be much higher. The FBI says that the “Megaupload Conspiracy” earned the quartet $175m and if extradited to the US, they could face decades in jail.

While Dotcom was not in court today, he has been active on Twitter.

“The court process went ‘off the rails’ when the only copyright expert Judge in NZ was >removed< from my case and replaced by a non-tech Judge who asked if Mega was ‘cow storage’. He then simply copy/pasted 85% of the US submissions into his judgment," Dotcom wrote.

Dotcom also appeared to question the suitability of judges at both the High Court and Court of Appeal for the task in hand.

“Justice Miller and Justice Gilbert (he wrote that High Court judgment) were business partners at the law firm Chapman Tripp which represents the Hollywood Studios in my case. Both Judges are now at the Court of Appeal. Gilbert was promoted shortly after ruling against me,” Dotcom added.

Dotcom is currently suing the New Zealand government for billions of dollars in damages over the warrant which triggered his arrest and the demise of Megaupload.

Today is Safer Internet Day, a global awareness campaign to educate the public on all sorts of threats that people face online.

It is a laudable initiative supported by the Industry Trust for IP Awareness which, together with the children’s charity Into Film, has released an informative video and associated course materials.

The organizations have created a British version of an animation previously released as part of the Australian “Price of Piracy” campaign. While the video includes an informative description of the various types of malware, there appears to be a secondary agenda.

Strangely enough, the video itself contains no advice on how to avoid malware at all, other than to avoid pirate sites. In that sense, it looks more like an indirect anti-piracy ad.

While there’s no denying that kids might run into malware if they randomly click on pirate site ads, this problem is certainly not exclusive to these sites. Email and social media are frequently used to link to malware too, and YouTube comments can pose the same risk. The problem is everywhere.

What really caught our eye, however, is the statement that pirate sites are the most used propagation method for malware. “Did you know, the number one way we infect your device is via illegal pirate sites,” an animated piece of malware claims in the video.

Forget about email attachments, spam links, compromised servers, or even network attacks. Pirate sites are the number one spot through which malware spreads. According to the video at least. But where do they get this knowledge?

Meet the malwares

When we asked the Industry Trust for IP Awareness for further details, the organization checked with their Australian colleagues, who pointed us to a working paper (pdf) from 2014. This paper includes the following line: “Illegal streaming websites are now the number one propagation mechanism for malicious software as 97% of them contain malware.”

Unfortunately, there’s a lot wrong with this claim.

Through another citation, the 97% figure points to this unpublished study of which only the highlights were shared. This “malware” research looked at the prevalence of malware and other unwanted software linked to pirate sites. Not just streaming sites as the other paper said, but let’s ignore that last bit.

What the study actually found is that of the 30 researched pirate sites, “90% contained malware or other ‘Potentially Unwanted Programmes’.” Note that this is not the earlier mentioned 97%, and that this broad category not only includes malware but also popup ads, which were most popular. This means that the percentage of actual malware on these sites can be anywhere from 0.1% to 90%.

Importantly, none of the malware found in this research was installed without an action performed by the user, such as clicking on a flashy download button or installing a mysterious .exe file.

Aside from clearly erroneous references, the more worrying issue is that even the original incorrect statement that “97% of all pirate sites contain malware” provides no evidence for the claim in the video that pirate sites are “the number one way” through which malware spreads.

Even if 100% of all pirate sites link to malware, that’s no proof that it’s the most used propagation method.

The malware issue has been a popular talking point for a while, but after searching for answers for days, we couldn’t find a grain of evidence. There are a lot of malware propagation methods, including email, which traditionally is a very popular choice.

Even more confusingly, the same paper that was cited as a source for the pirate site malware claim notes that 80% of all web-based malware is hosted on “innocent” but compromised websites.

As the provided evidence gave no answers, we asked the experts to chime in. Luckily, security company Malwarebytes was willing to share its assessment. As leaders in the anti-malware industry, they should know better than researchers who have their numbers and terminology mixed up.

“Torrent sites are still frequently used by criminals to host malware disguised as something the user wants, like an application, movie, etc. However they are really only a threat to people who use torrent sites regularly and those people have likely learned how to avoid malicious torrents,” he adds.

In other words, most people who regularly visit pirate sites know how to avoid these dangers. That doesn’t mean that they are not a threat to unsuspecting kids who visit them for the first time of course.

“Now, if users who were not familiar with torrent and pirate sites started using these services, there is a high probability that they could encounter some kind of malware. However, many of these sites have user review processes to let other users know if a particular torrent or download is likely malicious.

“So, unless a user is completely new to this process and ignores all the warning signs, they could walk away from a pirate site without getting infected,” Kujawa says.

Overall, the experts at Malwarebytes see no evidence for the claim that pirate sites are the number one propagation method for malware.

“So in summary, I don’t think the claim that ‘pirate sites’ are the number one way to infect users is accurate at all,” Kujawa concludes.

While it’s always a good idea to avoid places that can have a high prevalence of malware, including pirate sites, the claims in the video are not backed up by real evidence. There are tens of thousands of non-pirate sites that pose similar or worse risks, so it’s always a good idea to have anti-malware and virus software installed.

The organizations and people involved in the British “Meet the Malwares” video might not have been aware of the doubtful claims, but it’s unfortunate that they didn’t opt for a broader campaign instead of the focused anti-piracy message.

Finally, since it’s still Safer Internet Day, we encourage kids to take a close look at the various guides on how to avoid “fake news” while engaging in critical thinking.

I know from first hand experience the FBI is corrupt. In 2007, they threatened me, trying to get me to cancel a talk that revealed security vulnerabilities in a large corporation’s product. Such abuses occur because there is no transparency and oversight. FBI agents write down our conversation in their little notebooks instead of recording it, so that they can control the narrative of what happened, presenting their version of the converstion (leaving out the threats). In this day and age of recording devices, this is indefensible.

She writes “I know firsthand that it’s difficult to get a FISA warrant“. Yes, the process was difficult for her, an underling, to get a FISA warrant. The process is different when a leader tries to do the same thing.

I know this first hand having casually worked as an outsider with intelligence agencies. I saw two processes in place: one for the flunkies, and one for those above the system. The flunkies constantly complained about how there is too many process in place oppressing them, preventing them from getting their jobs done. The leaders understood the system and how to sidestep those processes.

That’s not to say the Nunes Memo has merit, but it does point out that privacy advocates have a point in wanting more oversight and transparency in such surveillance of American citizens.

Blaming us privacy advocates isn’t the way to go. It’s not going to succeed in tarnishing us, but will push us more into Trump’s camp, causing us to reiterate that we believe the FBI and FISA are corrupt.

For over a decade, civil libertarians have been fighting government mass surveillance of innocent Americans over the Internet. We’ve just lost an important battle. On January 18, President Trumpsigned the renewal of Section 702, domestic mass surveillance became effectively a permanent part of US law.

Section 702 was initially passed in 2008, as an amendment to the Foreign Intelligence Surveillance Act of 1978. As the title of that law says, it was billed as a way for the NSA to spy on non-Americans located outside the United States. It was supposed to be an efficiency and cost-saving measure: the NSA was already permitted to tap communications cables located outside the country, and it was already permitted to tap communications cables from one foreign country to another that passed through the United States. Section 702 allowed it to tap those cables from inside the United States, where it was easier. It also allowed the NSA to request surveillance data directly from Internet companies under a program called PRISM.

The problem is that this authority also gave the NSA the ability to collect foreign communications and data in a way that inherently and intentionally also swept up Americans’ communications as well, without a warrant. Other law enforcement agencies are allowed to ask the NSA to search those communications, give their contents to the FBI and other agencies and then lie about their origins in court.

In 1978, after Watergate had revealed the Nixon administration’s abuses of power, we erected a wall between intelligence and law enforcement that prevented precisely this kind of sharing of surveillance data under any authority less restrictive than the Fourth Amendment. Weakening that wall is incredibly dangerous, and the NSA should never have been given this authority in the first place.

Arguably, it never was. The NSA had been doing this type of surveillance illegally for years, something that was first made public in 2006. Section 702 was secretly used as a way to paper over that illegal collection, but nothing in the text of the later amendment gives the NSA this authority. We didn’t know that the NSA was using this law as the statutory basis for this surveillance until Edward Snowden showed us in 2013.

Civil libertarians have been battling this law in both Congress and the courts ever since it was proposed, and the NSA’s domestic surveillance activities even longer. What this most recent vote tells me is that we’ve lost that fight.

Section 702 was passed under George W. Bush in 2008, reauthorized under Barack Obama in 2012, and now reauthorized again under Trump. In all three cases, congressional support was bipartisan. It has survived multiple lawsuits by the Electronic Frontier Foundation, the ACLU, and others. It has survived the revelations by Snowden that it was being used far more extensively than Congress or the public believed, and numerous public reports of violations of the law. It has even survived Trump’s belief that he was being personally spied on by the intelligence community, as well as any congressional fears that Trump could abuse the authority in the coming years. And though this extension lasts only six years, it’s inconceivable to me that it will ever be repealed at this point.

So what do we do? If we can’t fight this particular statutory authority, where’s the new front on surveillance? There are, it turns out, reasonable modifications that target surveillance more generally, and not in terms of any particular statutory authority. We need to look at US surveillance law more generally.

First, we need to strengthen the minimization procedures to limit incidental collection. Since the Internet was developed, all the world’s communications travel around in a single global network. It’s impossible to collect only foreign communications, because they’re invariably mixed in with domestic communications. This is called “incidental” collection, but that’s a misleading name. It’s collected knowingly, and searched regularly. The intelligence community needs much stronger restrictions on which American communications channels it can access without a court order, and rules that require they delete the data if they inadvertently collect it. More importantly, “collection” is defined as the point the NSA takes a copy of the communications, and not later when they search their databases.

Second, we need to limit how other law enforcement agencies can use incidentally collected information. Today, those agencies can query a database of incidental collection on Americans. The NSA can legally pass information to those other agencies. This has to stop. Data collected by the NSA under its foreign surveillance authority should not be used as a vehicle for domestic surveillance.

The most recent reauthorization modified this lightly, forcing the FBI to obtain a court order when querying the 702 data for a criminal investigation. There are still exceptions and loopholes, though.

Third, we need to end what’s called “parallel construction.” Today, when a law enforcement agency uses evidence found in this NSA database to arrest someone, it doesn’t have to disclose that fact in court. It can reconstruct the evidence in some other manner once it knows about it, and then pretend it learned of it that way. This right to lie to the judge and the defense is corrosive to liberty, and it must end.

Pressure to reform the NSA will probably first come from Europe. Already, European Union courts have pointed to warrantless NSA surveillance as a reason to keep Europeans’ data out of US hands. Right now, there is a fragile agreement between the EU and the United States ­– called “Privacy Shield” — ­that requires Americans to maintain certain safeguards for international data flows. NSA surveillance goes against that, and it’s only a matter of time before EU courts start ruling this way. That’ll have significant effects on both government and corporate surveillance of Europeans and, by extension, the entire world.

Further pressure will come from the increased surveillance coming from the Internet of Things. When your home, car, and body are awash in sensors, privacy from both governments and corporations will become increasingly important. Sooner or later, society will reach a tipping point where it’s all too much. When that happens, we’re going to see significant pushback against surveillance of all kinds. That’s when we’ll get new laws that revise all government authorities in this area: a clean sweep for a new world, one with new norms and new fears.

It’s possible that a federal court will rule on Section 702. Although there have been many lawsuits challenging the legality of what the NSA is doing and the constitutionality of the 702 program, no court has ever ruled on those questions. The Bush and Obama administrations successfully argued that defendants don’t have legal standing to sue. That is, they have no right to sue because they don’t know they’re being targeted. If any of the lawsuits can get past that, things might change dramatically.

Meanwhile, much of this is the responsibility of the tech sector. This problem exists primarily because Internet companies collect and retain so much personal data and allow it to be sent across the network with minimal security. Since the government has abdicated its responsibility to protect our privacy and security, these companies need to step up: Minimize data collection. Don’t save data longer than absolutely necessary. Encrypt what has to be saved. Well-designed Internet services will safeguard users, regardless of government surveillance authority.

For the rest of us concerned about this, it’s important not to give up hope. Everything we do to keep the issue in the public eye ­– and not just when the authority comes up for reauthorization again in 2024 — hastens the day when we will reaffirm our rights to privacy in the digital age.

It was an accident

The most important fact about Wannacry is that it was an accident. We’ve had 30 years of experience with Internet worms teaching us that worms are always accidents. While launching worms may be intentional, their effects cannot be predicted. While they appear to have targets, like Slammer against South Korea, or Witty against the Pentagon, further analysis shows this was just a random effect that was impossible to predict ahead of time. Only in hindsight are these effects explainable.

We should hold those causing accidents accountable, too, but it’s a different accountability. The U.S. has caused more civilian deaths in its War on Terror than the terrorists caused triggering that war. But we hold these to be morally different: the terrorists targeted the innocent, whereas the U.S. takes great pains to avoid civilian casualties.

Since we are talking about blaming those responsible for accidents, we also must include the NSA in that mix. The NSA created, then allowed the release of, weaponized exploits. That’s like accidentally dropping a load of unexploded bombs near a village. When those bombs are then used, those having lost the weapons are held guilty along with those using them. Yes, while we should blame the hacker who added ETERNAL BLUE to their ransomware, we should also blame the NSA for losing control of ETERNAL BLUE.

A country and its assets are different

Was it North Korea, or hackers affilliated with North Korea? These aren’t the same.

It’s hard for North Korea to have hackers of its own. It doesn’t have citizens who grow up with computers to pick from. Moreover, an internal hacking corps would create tainted citizens exposed to dangerous outside ideas. Update: Some people have pointed out that Kim Il-sung University in the capital does have some contact with the outside world, with academics granted limited Internet access, so I guess some tainting is allowed. Still, what we know of North Korea hacking efforts largley comes from hackers they employ outside North Korea. It was the Lazurus Group, outside North Korea, that did Wannacry.

Instead, North Korea develops external hacking “assets”, supporting several external hacking groups in China, Japan, and South Korea. This is similar to how intelligence agencies develop human “assets” in foreign countries. While these assets do things for their handlers, they also have normal day jobs, and do many things that are wholly independent and even sometimes against their handler’s interests.

For example, this Muckrock FOIA dump shows how “CIA assets” independently worked for Castro and assassinated a Panamanian president. That they also worked for the CIA does not make the CIA responsible for the Panamanian assassination.

That CIA/intelligence assets work this way is well-known and uncontroversial. The fact that countries use hacker assets like this is the controversial part. These hackers do act independently, yet we refuse to consider this when we want to “attribute” attacks.

Attribution is political

We have far better attribution for the nPetya attacks. It was less accidental (they clearly desired to disrupt Ukraine), and the hackers were much closer to the Russian government (Russian citizens). Yet, the Trump administration isn’t fighting Russia, they are fighting North Korea, so they don’t officially attribute nPetya to Russia, but do attribute Wannacry to North Korea.

Trump is in conflict with North Korea. He is looking for ways to escalate the conflict. Attributing Wannacry helps achieve his political objectives.

That it was blatantly politics is demonstrated by the way it was released to the press. It wasn’t released in the normal way, where the administration can stand behind it, and get challenged on the particulars. Instead, it was pre-released through the normal system of “anonymous government officials” to the NYTimes, and then backed up with op-ed in the Wall Street Journal. The government leaks information like this when it’s weak, not when its strong.

The proper way is to release the evidence upon which the decision was made, so that the public can challenge it. Among the questions the public would ask is whether it they believe it was North Korea’s intention to cause precisely this effect, such as disabling the British NHS. Or, whether it was merely hackers “affiliated” with North Korea, or hackers carrying out North Korea’s orders. We cannot challenge the government this way because the government intentionally holds itself above such accountability.

Conclusion

We believe hacking groups tied to North Korea are responsible for Wannacry. Yet, even if that’s true, we still have three attribution problems. We still don’t know if that was intentional, in pursuit of some political goal, or an accident. We still don’t know if it was at the direction of North Korea, or whether their hacker assets acted independently. We still don’t know if the government has answers to these questions, or whether it’s exploiting this doubt to achieve political support for actions against North Korea.

On January 3, the world learned about a series of major security vulnerabilities in modern microprocessors. Called Spectre and Meltdown, these vulnerabilities were discovered by several different researchers last summer, disclosed to the microprocessors’ manufacturers, and patched­ — at least to the extent possible.

This news isn’t really any different from the usual endless stream of security vulnerabilities and patches, but it’s also a harbinger of the sorts of security problems we’re going to be seeing in the coming years. These are vulnerabilities in computer hardware, not software. They affect virtually all high-end microprocessors produced in the last 20 years. Patching them requires large-scale coordination across the industry, and in some cases drastically affects the performance of the computers. And sometimes patching isn’t possible; the vulnerability will remain until the computer is discarded.

Spectre and Meltdown aren’t anomalies. They represent a new area to look for vulnerabilities and a new avenue of attack. They’re the future of security­ — and it doesn’t look good for the defenders.

Modern computers do lots of things at the same time. Your computer and your phone simultaneously run several applications — ­or apps. Your browser has several windows open. A cloud computer runs applications for many different computers. All of those applications need to be isolated from each other. For security, one application isn’t supposed to be able to peek at what another one is doing, except in very controlled circumstances. Otherwise, a malicious advertisement on a website you’re visiting could eavesdrop on your banking details, or the cloud service purchased by some foreign intelligence organization could eavesdrop on every other cloud customer, and so on. The companies that write browsers, operating systems, and cloud infrastructure spend a lot of time making sure this isolation works.

Both Spectre and Meltdown break that isolation, deep down at the microprocessor level, by exploiting performance optimizations that have been implemented for the past decade or so. Basically, microprocessors have become so fast that they spend a lot of time waiting for data to move in and out of memory. To increase performance, these processors guess what data they’re going to receive and execute instructions based on that. If the guess turns out to be correct, it’s a performance win. If it’s wrong, the microprocessors throw away what they’ve done without losing any time. This feature is called speculative execution.

Spectre and Meltdown attack speculative execution in different ways. Meltdown is more of a conventional vulnerability; the designers of the speculative-execution process made a mistake, so they just needed to fix it. Spectre is worse; it’s a flaw in the very concept of speculative execution. There’s no way to patch that vulnerability; the chips need to be redesigned in such a way as to eliminate it.

Since the announcement, manufacturers have been rolling out patches to these vulnerabilities to the extent possible. Operating systems have been patched so that attackers can’t make use of the vulnerabilities. Web browsers have been patched. Chips have been patched. From the user’s perspective, these are routine fixes. But several aspects of these vulnerabilities illustrate the sorts of security problems we’re only going to be seeing more of.

First, attacks against hardware, as opposed to software, will become more common. Last fall, vulnerabilities were discovered in Intel’s Management Engine, a remote-administration feature on its microprocessors. Like Spectre and Meltdown, they affected how the chips operate. Looking for vulnerabilities on computer chips is new. Now that researchers know this is a fruitful area to explore, security researchers, foreign intelligence agencies, and criminals will be on the hunt.

Second, because microprocessors are fundamental parts of computers, patching requires coordination between many companies. Even when manufacturers like Intel and AMD can write a patch for a vulnerability, computer makers and application vendors still have to customize and push the patch out to the users. This makes it much harder to keep vulnerabilities secret while patches are being written. Spectre and Meltdown were announced prematurely because details were leaking and rumors were swirling. Situations like this give malicious actors more opportunity to attack systems before they’re guarded.

Third, these vulnerabilities will affect computers’ functionality. In some cases, the patches for Spectre and Meltdown result in significant reductions in speed. The press initially reported 30%, but that only seems true for certain servers running in the cloud. For your personal computer or phone, the performance hit from the patch is minimal. But as more vulnerabilities are discovered in hardware, patches will affect performance in noticeable ways.

And then there are the unpatchable vulnerabilities. For decades, the computer industry has kept things secure by finding vulnerabilities in fielded products and quickly patching them. Now there are cases where that doesn’t work. Sometimes it’s because computers are in cheap products that don’t have a patch mechanism, like many of the DVRs and webcams that are vulnerable to the Mirai (and other) botnets — ­groups of Internet-connected devices sabotaged for coordinated digital attacks. Sometimes it’s because a computer chip’s functionality is so core to a computer’s design that patching it effectively means turning the computer off. This, too, is becoming more common.

Increasingly, everything is a computer: not just your laptop and phone, but your car, your appliances, your medical devices, and global infrastructure. These computers are and always will be vulnerable, but Spectre and Meltdown represent a new class of vulnerability. Unpatchable vulnerabilities in the deepest recesses of the world’s computer hardware is the new normal. It’s going to leave us all much more vulnerable in the future.

This week, police forces around Europe took action against what is believed to be one of the world’s largest pirate IPTV networks.

The investigation, launched a year ago and coordinated by Europol, came to head on Tuesday when police carried out raids in Cyprus, Bulgaria, Greece, and the Netherlands. A fresh announcement from the crime-fighting group reveals the scale of the operation.

It was led by the Cypriot Police – Intellectual Property Crime Unit, with the support of the Cybercrime Division of the Greek Police, the Dutch Fiscal Investigative and Intelligence Service (FIOD), the Cybercrime Unit of the Bulgarian Police, Europol’s Intellectual Property Crime Coordinated Coalition (IPC³), and supported by members of the Audiovisual Anti-Piracy Alliance (AAPA).

In Cyprus, Bulgaria and Greece, 17 house searches were carried out. Three individuals aged 43, 44, and 53 were arrested in Cyprus and one was arrested in Bulgaria.

All stand accused of being involved in an international operation to illegally broadcast around 1,200 channels of pirated content to an estimated 500,000 subscribers. Some of the channels offered were illegally sourced from Sky UK, Bein Sports, Sky Italia, and Sky DE. On Thursday, the three individuals in Cyprus were remanded in custody for seven days.

“The servers used to distribute the channels were shut down, and IP addresses hosted by a Dutch company were also deactivated thanks to the cooperation of the authorities of The Netherlands,” Europol reports.

TorrentFreak was previously able to establish that Megabyte-Internet Ltd, an ISP located in the small Bulgarian town Petrich, was targeted by police. The provider went down on Tuesday but returned towards the end of the week. Responding to our earlier inquiries, the company told us more about the situation.

“We are an ISP provider located in Petrich, Bulgaria. We are selling services to around 1,500 end-clients in the Petrich area and surrounding villages,” a spokesperson explained.

“Another part of our business is internet services like dedicated unmanaged servers, hosting, email servers, storage services, and VPNs etc.”

The spokesperson added that some of Megabyte’s equipment is located at Telepoint, Bulgaria’s biggest datacenter, with connectivity to Petrich. During the raid the police seized the company’s hardware to check for evidence of illegal activity.

“We were informed by the police that some of our clients in Petrich and Sofia were using our service for illegal streaming and actions,” the company said.

“Of course, we were not able to know this because our services are unmanaged and root access [to servers] is given to our clients. For this reason any client and anyone that uses our services are responsible for their own actions.”

TorrentFreak asked many more questions, including how many police attended, what type and volume of hardware was seized, and whether anyone was arrested or taken for questioning. But, apart from noting that the police were friendly, the company declined to give us any additional information, revealing that it was not permitted to do so at this stage.

What is clear, however, is that Megabyte-Internet is offering its full cooperation to the authorities. The company says that it cannot be held responsible for the actions of its clients so their details will be handed over as part of the investigation.

“So now we will give to the police any details about these clients because we hold their full details by law. [The police] will find [out about] all the illegal actions from them,” the company concludes, adding that it’s fully operational once more and working with clients.

Last week I attended a talk given by Bryan Mistele, president of Seattle-based INRIX. Bryan’s talk provided a glimpse into the future of transportation, centering around four principle attributes, often abbreviated as ACES:

Autonomous – Cars and trucks are gaining the ability to scan and to make sense of their environments and to navigate without human input.

Connected – Vehicles of all types have the ability to take advantage of bidirectional connections (either full-time or intermittent) to other cars and to cloud-based resources. They can upload road and performance data, communicate with each other to run in packs, and take advantage of traffic and weather data.

Electric – Continued development of battery and motor technology, will make electrics vehicles more convenient, cost-effective, and environmentally friendly.

Individually and in combination, these emerging attributes mean that the cars and trucks we will see and use in the decade to come will be markedly different than those of the past.

On the Road with AWS AWS customers are already using our AWS IoT, edge computing, Amazon Machine Learning, and Alexa products to bring this future to life – vehicle manufacturers, their tier 1 suppliers, and AutoTech startups all use AWS for their ACES initiatives. AWS Greengrass is playing an important role here, attracting design wins and helping our customers to add processing power and machine learning inferencing at the edge.

AWS customer Aptiv (formerly Delphi) talked about their Automated Mobility on Demand (AMoD) smart vehicle architecture in a AWS re:Invent session. Aptiv’s AMoD platform will use Greengrass and microservices to drive the onboard user experience, along with edge processing, monitoring, and control. Here’s an overview:

Another customer, Denso of Japan (one of the world’s largest suppliers of auto components and software) is using Greengrass and AWS IoT to support their vision of Mobility as a Service (MaaS). Here’s a video:

AWS at CES The AWS team will be out in force at CES in Las Vegas and would love to talk to you. They’ll be running demos that show how AWS can help to bring innovation and personalization to connected and autonomous vehicles.

Personalized In-Vehicle Experience – This demo shows how AWS AI and Machine Learning can be used to create a highly personalized and branded in-vehicle experience. It makes use of Amazon Lex, Polly, and Amazon Rekognition, but the design is flexible and can be used with other services as well. The demo encompasses driver registration, login and startup (including facial recognition), voice assistance for contextual guidance, personalized e-commerce, and vehicle control. Here’s the architecture for the voice assistance:

Connected Vehicle Solution – This demo shows how a connected vehicle can combine local and cloud intelligence, using edge computing and machine learning at the edge. It handles intermittent connections and uses AWS DeepLens to train a model that responds to distracted drivers. Here’s the overall architecture, as described in our Connected Vehicle Solution:

Digital Content Delivery – This demo will show how a customer uses a web-based 3D configurator to build and personalize their vehicle. It will also show high resolution (4K) 3D image and an optional immersive AR/VR experience, both designed for use within a dealership.

Autonomous Driving – This demo will showcase the AWS services that can be used to build autonomous vehicles. There’s a 1/16th scale model vehicle powered and driven by Greengrass and an overview of a new AWS Autonomous Toolkit. As part of the demo, attendees drive the car, training a model via Amazon SageMaker for subsequent on-board inferencing, powered by Greengrass ML Inferencing.

To speak to one of my colleagues or to set up a time to see the demos, check out the Visit AWS at CES 2018 page.

Some Resources If you are interested in this topic and want to learn more, the AWS for Automotive page is a great starting point, with discussions on connected vehicles & mobility, autonomous vehicle development, and digital customer engagement.

When you are ready to start building a connected vehicle, the AWS Connected Vehicle Solution contains a reference architecture that combines local computing, sophisticated event rules, and cloud-based data processing and storage. You can use this solution to accelerate your own connected vehicle projects.

Noted below are the upcoming scheduled live, online technical sessions being held during the month of January. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts.

With Amazon Redshift, you can build petabyte-scale data warehouses that unify data from a variety of internal and external sources. Because Amazon Redshift is optimized for complex queries (often involving multiple joins) across large tables, it can handle large volumes of retail, inventory, and financial data without breaking a sweat.

In this post, we describe how to combine data in Aurora in Amazon Redshift. Here’s an overview of the solution:

Use AWS Lambda functions with Amazon Aurora to capture data changes in a table.

Consider a scenario in which an e-commerce web application uses Amazon Aurora for a transactional database layer. The company has a sales table that captures every single sale, along with a few corresponding data items. This information is stored as immutable data in a table. Business users want to monitor the sales data and then analyze and visualize it.

In this example, you take the changes in data in an Aurora database table and save it in Amazon S3. After the data is captured in Amazon S3, you combine it with data in your existing Amazon Redshift cluster for analysis.

By the end of this post, you will understand how to capture data events in an Aurora table and push them out to other AWS services using AWS Lambda.

The following diagram shows the flow of data as it occurs in this tutorial:

The starting point in this architecture is a database insert operation in Amazon Aurora. When the insert statement is executed, a custom trigger calls a Lambda function and forwards the inserted data. Lambda writes the data that it received from Amazon Aurora to a Kinesis data delivery stream. Kinesis Data Firehose writes the data to an Amazon S3 bucket. Once the data is in an Amazon S3 bucket, it is queried in place using Amazon Redshift Spectrum.

Creating an Aurora database

First, create a database by following these steps in the Amazon RDS console:

Sign in to the AWS Management Console, and open the Amazon RDS console.

Choose Launch a DB instance, and choose Next.

For Engine, choose Amazon Aurora.

Choose a DB instance class. This example uses a small, since this is not a production database.

After you create the database, use MySQL Workbench to connect to the database using the CNAME from the console. For information about connecting to an Aurora database, see Connecting to an Amazon Aurora DB Cluster.

The following screenshot shows the MySQL Workbench configuration:

Next, create a table in the database by running the following SQL statement:

You can now populate the table with some sample data. To generate sample data in your table, copy and run the following script. Ensure that the highlighted (bold) variables are replaced with appropriate values.

The following screenshot shows how the table appears with the sample data:

Sending data from Amazon Aurora to Amazon S3

There are two methods available to send data from Amazon Aurora to Amazon S3:

Using a Lambda function

Using SELECT INTO OUTFILE S3

To demonstrate the ease of setting up integration between multiple AWS services, we use a Lambda function to send data to Amazon S3 using Amazon Kinesis Data Firehose.

Alternatively, you can use a SELECT INTO OUTFILE S3 statement to query data from an Amazon Aurora DB cluster and save it directly in text files that are stored in an Amazon S3 bucket. However, with this method, there is a delay between the time that the database transaction occurs and the time that the data is exported to Amazon S3 because the default file size threshold is 6 GB.

Creating a Kinesis data delivery stream

The next step is to create a Kinesis data delivery stream, since it’s a dependency of the Lambda function.

To create a delivery stream:

Open the Kinesis Data Firehose console

Choose Create delivery stream.

For Delivery stream name, type AuroraChangesToS3.

For Source, choose Direct PUT.

For Record transformation, choose Disabled.

For Destination, choose Amazon S3.

In the S3 bucket drop-down list, choose an existing bucket, or create a new one.

Enter a prefix if needed, and choose Next.

For Data compression, choose GZIP.

In IAM role, choose either an existing role that has access to write to Amazon S3, or choose to generate one automatically. Choose Next.

Review all the details on the screen, and choose Create delivery stream when you’re finished.

Creating a Lambda function

Now you can create a Lambda function that is called every time there is a change that needs to be tracked in the database table. This Lambda function passes the data to the Kinesis data delivery stream that you created earlier.

To create the Lambda function:

Open the AWS Lambda console.

Ensure that you are in the AWS Region where your Amazon Aurora database is located.

If you have no Lambda functions yet, choose Get started now. Otherwise, choose Create function.

Choose Author from scratch.

Give your function a name and select Python 3.6 for Runtime

Choose and existing or create a new Role, the role would need to have access to call firehose:PutRecord

Choose Next on the trigger selection screen.

Paste the following code in the code window. Change the stream_name variable to the Kinesis data delivery stream that you created in the previous step.

Once you are finished, the Amazon Aurora database has access to invoke a Lambda function.

Creating a stored procedure and a trigger in Amazon Aurora

Now, go back to MySQL Workbench, and run the following command to create a new stored procedure. When this stored procedure is called, it invokes the Lambda function you created. Change the ARN in the following code to your Lambda function’s ARN.

If a new row is inserted in the Sales table, the Lambda function that is mentioned in the stored procedure is invoked.

Verify that data is being sent from the Lambda function to Kinesis Data Firehose to Amazon S3 successfully. You might have to insert a few records, depending on the size of your data, before new records appear in Amazon S3. This is due to Kinesis Data Firehose buffering. To learn more about Kinesis Data Firehose buffering, see the “Amazon S3” section in Amazon Kinesis Data Firehose Data Delivery.

Every time a new record is inserted in the sales table, a stored procedure is called, and it updates data in Amazon S3.

Querying data in Amazon Redshift

In this section, you use the data you produced from Amazon Aurora and consume it as-is in Amazon Redshift. In order to allow you to process your data as-is, where it is, while taking advantage of the power and flexibility of Amazon Redshift, you use Amazon Redshift Spectrum. You can use Redshift Spectrum to run complex queries on data stored in Amazon S3, with no need for loading or other data prep.

Just create a data source and issue your queries to your Amazon Redshift cluster as usual. Behind the scenes, Redshift Spectrum scales to thousands of instances on a per-query basis, ensuring that you get fast, consistent performance even as your dataset grows to beyond an exabyte! Being able to query data that is stored in Amazon S3 means that you can scale your compute and your storage independently. You have the full power of the Amazon Redshift query model and all the reporting and business intelligence tools at your disposal. Your queries can reference any combination of data stored in Amazon Redshift tables and in Amazon S3.

Querying data in local and external tables using Amazon Redshift

Now that you have the fact and dimension table populated with data, you can combine the two and run analysis. For example, if you want to query the total sales amount by weekday, you can run the following:

select sum(quantity*price) as total_sales, date_dimension.d_season
from spectrum_schema.ecommerce_sales
join date_dimension on spectrum_schema.ecommerce_sales.orderdate = date_dimension.d_prettydate
group by date_dimension.d_season

You get the following results:

Similarly, you can replace d_season with d_dayofweek to get sales figures by weekday:

With Amazon Redshift Spectrum, you pay only for the queries you run against the data that you actually scan. We encourage you to use file partitioning, columnar data formats, and data compression to significantly minimize the amount of data scanned in Amazon S3. This is important for data warehousing because it dramatically improves query performance and reduces cost.

Partitioning your data in Amazon S3 by date, time, or any other custom keys enables Amazon Redshift Spectrum to dynamically prune nonrelevant partitions to minimize the amount of data processed. If you store data in a columnar format, such as Parquet, Amazon Redshift Spectrum scans only the columns needed by your query, rather than processing entire rows. Similarly, if you compress your data using one of the supported compression algorithms in Amazon Redshift Spectrum, less data is scanned.

Analyzing and visualizing Amazon Redshift data in Amazon QuickSight

After modifying the Amazon Redshift security group, go to Amazon QuickSight. Create a new analysis, and choose Amazon Redshift as the data source.

Enter the database connection details, validate the connection, and create the data source.

Choose the schema to be analyzed. In this case, choose spectrum_schema, and then choose the ecommerce_sales table.

Next, we add a custom field for Total Sales = Price*Quantity. In the drop-down list for the ecommerce_sales table, choose Edit analysis data sets.

On the next screen, choose Edit.

In the data prep screen, choose New Field. Add a new calculated field Total Sales $, which is the product of the Price*Quantity fields. Then choose Create. Save and visualize it.

Next, to visualize total sales figures by month, create a graph with Total Sales on the x-axis and Order Data formatted as month on the y-axis.

After you’ve finished, you can use Amazon QuickSight to add different columns from your Amazon Redshift tables and perform different types of visualizations. You can build operational dashboards that continuously monitor your transactional and analytical data. You can publish these dashboards and share them with others.

Final notes

Amazon QuickSight can also read data in Amazon S3 directly. However, with the method demonstrated in this post, you have the option to manipulate, filter, and combine data from multiple sources or Amazon Redshift tables before visualizing it in Amazon QuickSight.

In this example, we dealt with data being inserted, but triggers can be activated in response to an INSERT, UPDATE, or DELETE trigger.

Keep the following in mind:

Be careful when invoking a Lambda function from triggers on tables that experience high write traffic. This would result in a large number of calls to your Lambda function. Although calls to the lambda_async procedure are asynchronous, triggers are synchronous.

A statement that results in a large number of trigger activations does not wait for the call to the AWS Lambda function to complete. But it does wait for the triggers to complete before returning control to the client.

Similarly, you must account for Amazon Kinesis Data Firehose limits. By default, Kinesis Data Firehose is limited to a maximum of 5,000 records/second. For more information, see Monitoring Amazon Kinesis Data Firehose.

In certain cases, it may be optimal to use AWS Database Migration Service (AWS DMS) to capture data changes in Aurora and use Amazon S3 as a target. For example, AWS DMS might be a good option if you don’t need to transform data from Amazon Aurora. The method used in this post gives you the flexibility to transform data from Aurora using Lambda before sending it to Amazon S3. Additionally, the architecture has the benefits of being serverless, whereas AWS DMS requires an Amazon EC2 instance for replication.

Additional Reading

About the Authors

Re Alvarez-Parmar is a solutions architect for Amazon Web Services. He helps enterprises achieve success through technical guidance and thought leadership. In his spare time, he enjoys spending time with his two kids and exploring outdoors.

On an average day, Google processes more than three million takedown notices from copyright holders, and that’s for its search engine alone.

Under the current DMCA legislation, US-based Internet service providers are expected to remove infringing links, if a copyright holder complains.

This process shields these services from direct liability. In recent years there has been a lot of discussion about the effectiveness of the system, but Google has always maintained that it works well.

This was also highlighted by Google’s copyright counsel Caleb Donaldson, in an article he wrote for the American Bar Association’s publication Landslide.

“The DMCA provided Google and other online service providers the legal certainty they needed to grow,” Donaldson writes.

“And the DMCA’s takedown notices help us fight piracy in other ways as well. Indeed, the Web Search notice-and-takedown process provides the cornerstone of Google’s fight against piracy.”

The search engine does indeed go beyond ‘just’ removing links. The takedown notices are also used as a signal to demote domains. Websites for which it receives a lot of takedown notices will be placed lower in search results, for example.

These measures can be expanded and complemented by artificial intelligence in the future, Google’s copyright counsel envisions.

“As we move into a world where artificial intelligence can learn from vast troves of data like these, we will only get better at using the information to better fight against piracy,” Donaldson writes.

Artificial intelligence (AI) is a buzz-term that has a pretty broad meaning nowadays. Donaldson doesn’t go into detail on how AI can fight piracy. It could help to spot erroneous notices, on the one hand, but can also be applied to filter content proactively.

The latter is something Google is slowly opening up to.

Over the past year, we’ve noticed on a few occasions that Google is processing takedown notices for non-indexed links. While we assumed that this was an ‘error’ on the sender’s part, it appears to be a new policy.

“Google has critically expanded notice and takedown in another important way: We accept notices for URLs that are not even in our index in the first place. That way, we can collect information even about pages and domains we have not yet crawled,” Donaldson writes.

In other words, Google blocks URLs before they appear in the search results, as some sort of piracy vaccine.

“We process these URLs as we do the others. Once one of these not-in-index URLs is approved for takedown, we prophylactically block it from appearing in our Search results, and we take all the additional deterrent measures listed above.”

Some submitters are heavily relying on the new feature, Google found. In some cases, the majority of the submitted URLs in a notice are not indexed yet.

The search engine will keep a close eye on these developments. At TorrentFreak, we also found that copyright holders sometimes target links that don’t even exist. Whether Google will also accept these takedown requests in the future, is unknown.

It’s clear that artificial intelligence and proactive filtering are becoming more and more common, but Google says that the company will also keep an eye on possible abuse of the system.

“Google will push back if we suspect a notice is mistaken, fraudulent, or abusive, or if we think fair use or another defense excuses that particular use of copyrighted content,” Donaldson notes.

Artificial intelligence and prophylactic blocking surely add a new dimension to the standard DMCA takedown procedure, but whether it will be enough to convince copyright holders that it works, has yet to be seen.

In this episode, we meet Sam Woolley, director of the Digital Intelligence Lab at the Institute for the Future, to dig deeper into the topic of troll farms, political disinformation and the use of social media bots to create what Sam calls ‘Computational Propaganda’.

What happens when the ability to create propaganda is democratized out of the hands of governments and corporate media and into the hands of unknown, weird and downright dangerous online actors?

—

Steal This Show aims to release bi-weekly episodes featuring insiders discussing copyright and file-sharing news. It complements our regular reporting by adding more room for opinion, commentary, and analysis.

The guests for our news discussions will vary, and we’ll aim to introduce voices from different backgrounds and persuasions. In addition to news, STS will also produce features interviewing some of the great innovators and minds.

The video is a demo of my “Not Santa” detector that I deployed to the Raspberry Pi. I trained the detector using deep learning, Keras, and Python. You can find the full source code and tutorial here: https://www.pyimagesearch.com/2017/12/18/keras-deep-learning-raspberry-pi/

Ho-ho-how does it work?

Note: Adrian Rosebrock is not Santa. But he does a good enough impression of the jolly old fellow that his disguise can fool a Raspberry Pi into thinking otherwise.

We jest, but has anyone seen Adrian and Santa in the same room together?Image c/o Adrian Rosebrock

But how is the Raspberry Pi able to detect the Santa-ness or Not-Santa-ness of people who walk into the frame?

Two words: deep learning

If you’re not sure what deep learning is, you’re not alone. It’s a hefty topic, and one that Adrian has written a book about, so I grilled him for a bluffers’ guide. In his words, deep learning is:

…a subfield of machine learning, which is, in turn a subfield of artificial intelligence (AI). While AI embodies a large, diverse set of techniques and algorithms related to automatic reasoning (inference, planning, heuristics, etc), the machine learning subfields are specifically interested in pattern recognition and learning from data.

Artificial Neural Networks (ANNs) are a class of machine learning algorithms that can learn from data. We have been using ANNs successfully for over 60 years, but something special happened in the past 5 years — (1) we’ve been able to accumulate massive datasets, orders of magnitude larger than previous datasets, and (2) we have access to specialized hardware to train networks faster (i.e., GPUs).

Given these large datasets and specialized hardware, deeper neural networks can be trained, leading to the term “deep learning”.

So now we have a bird’s-eye view of deep learning, how does the detector detect?

The camera captures footage of Santa in the wild, while the Christmas tree add-on provides a twinkly notification, accompanied by a resonant ho, ho, ho from the speakers.

A deeper deep dive into deep learning

A full breakdown of the project and the workings of the Not Santa detector can be found on Adrian’s blog, PyImageSearch, which includes links to other deep learning and image classification tutorials using TensorFlow and Keras. It’s an excellent place to start if you’d like to understand more about deep learning.

Build your own Santa detector

Santa might catch on to Adrian’s clever detector and start avoiding the camera, and for that eventuality, we have our own Santa detector. It uses motion detection to notify you of his presence (and your presents!).

Check out our Santa Detector resource here and use a passive infrared sensor, Raspberry Pi, and Scratch to catch the big man in action.

In late September, during the annual Splunk .conf, Splunk and Amazon Web Services (AWS) jointly announced that Amazon Kinesis Data Firehose now supports Splunk Enterprise and Splunk Cloud as a delivery destination. This native integration between Splunk Enterprise, Splunk Cloud, and Amazon Kinesis Data Firehose is designed to make AWS data ingestion setup seamless, while offering a secure and fault-tolerant delivery mechanism. We want to enable customers to monitor and analyze machine data from any source and use it to deliver operational intelligence and optimize IT, security, and business performance.

With Kinesis Data Firehose, customers can use a fully managed, reliable, and scalable data streaming solution to Splunk. In this post, we tell you a bit more about the Kinesis Data Firehose and Splunk integration. We also show you how to ingest large amounts of data into Splunk using Kinesis Data Firehose.

Push vs. Pull data ingestion

Presently, customers use a combination of two ingestion patterns, primarily based on data source and volume, in addition to existing company infrastructure and expertise:

The pull-based approach offers data delivery guarantees such as retries and checkpointing out of the box. However, it requires more ops to manage and orchestrate the dedicated pollers, which are commonly running on Amazon EC2 instances. With this setup, you pay for the infrastructure even when it’s idle.

On the other hand, the push-based approach offers a low-latency scalable data pipeline made up of serverless resources like AWS Lambda sending directly to Splunk indexers (by using Splunk HEC). This approach translates into lower operational complexity and cost. However, if you need guaranteed data delivery then you have to design your solution to handle issues such as a Splunk connection failure or Lambda execution failure. To do so, you might use, for example, AWS Lambda Dead Letter Queues.

How about getting the best of both worlds?

Let’s go over the new integration’s end-to-end solution and examine how Kinesis Data Firehose and Splunk together expand the push-based approach into a native AWS solution for applicable data sources.

By using a managed service like Kinesis Data Firehose for data ingestion into Splunk, we provide out-of-the-box reliability and scalability. One of the pain points of the old approach was the overhead of managing the data collection nodes (Splunk heavy forwarders). With the new Kinesis Data Firehose to Splunk integration, there are no forwarders to manage or set up. Data producers (1) are configured through the AWS Management Console to drop data into Kinesis Data Firehose.

You might need to transform the data before it goes into Splunk for analysis. For example, you might want to enrich it or filter or anonymize sensitive data. You can do so using AWS Lambda. In this scenario, Kinesis Data Firehose buffers data from the incoming source data, sends it to the specified Lambda function (2), and then rebuffers the transformed data to the Splunk Cluster. Kinesis Data Firehose provides the Lambda blueprints that you can use to create a Lambda function for data transformation.

Systems fail all the time. Let’s see how this integration handles outside failures to guarantee data durability. In cases when Kinesis Data Firehose can’t deliver data to the Splunk Cluster, data is automatically backed up to an S3 bucket. You can configure this feature while creating the Firehose delivery stream (3). You can choose to back up all data or only the data that’s failed during delivery to Splunk.

In addition to using S3 for data backup, this Firehose integration with Splunk supports Splunk Indexer Acknowledgments to guarantee event delivery. This feature is configured on Splunk’s HTTP Event Collector (HEC) (4). It ensures that HEC returns an acknowledgment to Kinesis Data Firehose only after data has been indexed and is available in the Splunk cluster (5).

Now let’s look at a hands-on exercise that shows how to forward VPC flow logs to Splunk.

Data coming from CloudWatch Logs is compressed with gzip compression. To work with this compression, we need to configure a Lambda-based data transformation in Kinesis Data Firehose to decompress the data and deposit it back into the stream. Firehose then delivers the raw logs to the Splunk Http Event Collector (HEC).

If delivery to the Splunk HEC fails, Firehose deposits the logs into an Amazon S3 bucket. You can then ingest the events from S3 using an alternate mechanism such as a Lambda function.

When data reaches Splunk (Enterprise or Cloud), Splunk parsing configurations (packaged in the Splunk Add-on for Kinesis Data Firehose) extract and parse all fields. They make data ready for querying and visualization using Splunk Enterprise and Splunk Cloud.

Walkthrough

Install the Splunk Add-on for Amazon Kinesis Data Firehose

The Splunk Add-on for Amazon Kinesis Data Firehose enables Splunk (be it Splunk Enterprise, Splunk App for AWS, or Splunk Enterprise Security) to use data ingested from Amazon Kinesis Data Firehose. Install the Add-on on all the indexers with an HTTP Event Collector (HEC). The Add-on is available for download from Splunkbase.

HTTP Event Collector (HEC)

Before you can use Kinesis Data Firehose to deliver data to Splunk, set up the Splunk HEC to receive the data. From Splunk web, go to the Setting menu, choose Data Inputs, and choose HTTP Event Collector. Choose Global Settings, ensure All tokens is enabled, and then choose Save. Then choose New Token to create a new HEC endpoint and token. When you create a new token, make sure that Enable indexer acknowledgment is checked.

When prompted to select a source type, select aws:cloudwatch:vpcflow.

Create an S3 backsplash bucket

To provide for situations in which Kinesis Data Firehose can’t deliver data to the Splunk Cluster, we use an S3 bucket to back up the data. You can configure this feature to back up all data or only the data that’s failed during delivery to Splunk.

Note: Bucket names are unique. Thus, you can’t use tmak-backsplash-bucket.

Create a Firehose Stream

On the AWS console, open the Amazon Kinesis service, go to the Firehose console, and choose Create Delivery Stream.

In the next section, you can specify whether you want to use an inline Lambda function for transformation. Because incoming CloudWatch Logs are gzip compressed, choose Enabled for Record transformation, and then choose Create new.

From the list of the available blueprint functions, choose Kinesis Data Firehose CloudWatch Logs Processor. This function unzips data and place it back into the Firehose stream in compliance with the record transformation output model.

Enter a name for the Lambda function, choose Choose an existing role, and then choose the role you created earlier. Then choose Create Function.

Go back to the Firehose Stream wizard, choose the Lambda function you just created, and then choose Next.

Note: Amazon Kinesis Data Firehose requires the Splunk HTTP Event Collector (HEC) endpoint to be terminated with a valid CA-signed certificate matching the DNS hostname used to connect to your HEC endpoint. You receive delivery errors if you are using a self-signed certificate.

In this example, we only back up logs that fail during delivery.

To monitor your Firehose delivery stream, enable error logging. Doing this means that you can monitor record delivery errors.

Create an IAM role for the Firehose stream by choosing Create new, or Choose. Doing this brings you to a new screen. Choose Create a new IAM role, give the role a name, and then choose Allow.

If you look at the policy document, you can see that the role gives Kinesis Data Firehose permission to publish error logs to CloudWatch, execute your Lambda function, and put records into your S3 backup bucket.

You now get a chance to review and adjust the Firehose stream settings. When you are satisfied, choose Create Stream. You get a confirmation once the stream is created and active.

Create a VPC Flow Log

To send events from Amazon VPC, you need to set up a VPC flow log. If you already have a VPC flow log you want to use, you can skip to the “Publish CloudWatch to Kinesis Data Firehose” section.

On the AWS console, open the Amazon VPC service. Then choose VPC, Your VPC, and choose the VPC you want to send flow logs from. Choose Flow Logs, and then choose Create Flow Log. If you don’t have an IAM role that allows your VPC to publish logs to CloudWatch, choose Set Up Permissions and Create new role. Use the defaults when presented with the screen to create the new IAM role.

Once active, your VPC flow log should look like the following.

Publish CloudWatch to Kinesis Data Firehose

When you generate traffic to or from your VPC, the log group is created in Amazon CloudWatch. The new log group has no subscription filter, so set up a subscription filter. Setting this up establishes a real-time data feed from the log group to your Firehose delivery stream.

At present, you have to use the AWS Command Line Interface (AWS CLI) to create a CloudWatch Logs subscription to a Kinesis Data Firehose stream. However, you can use the AWS console to create subscriptions to Lambda and Amazon Elasticsearch Service.

To allow CloudWatch to publish to your Firehose stream, you need to give it permissions.

When you run the AWS CLI command preceding, you don’t get any acknowledgment. To validate that your CloudWatch Log Group is subscribed to your Firehose stream, check the CloudWatch console.

As soon as the subscription filter is created, the real-time log data from the log group goes into your Firehose delivery stream. Your stream then delivers it to your Splunk Enterprise or Splunk Cloud environment for querying and visualization. The screenshot following is from Splunk Enterprise.

In addition, you can monitor and view metrics associated with your delivery stream using the AWS console.

Conclusion

Although our walkthrough uses VPC Flow Logs, the pattern can be used in many other scenarios. These include ingesting data from AWS IoT, other CloudWatch logs and events, Kinesis Streams or other data sources using the Kinesis Agent or Kinesis Producer Library. We also used Lambda blueprint Kinesis Data Firehose CloudWatch Logs Processor to transform streaming records from Kinesis Data Firehose. However, you might need to use a different Lambda blueprint or disable record transformation entirely depending on your use case. For an additional use case using Kinesis Data Firehose, check out This is My Architecture Video, which discusses how to securely centralize cross-account data analytics using Kinesis and Splunk.

Additional Reading

About the Authors

Tarik Makota is a solutions architect with the Amazon Web Services Partner Network. He provides technical guidance, design advice and thought leadership to AWS’ most strategic software partners. His career includes work in an extremely broad software development and architecture roles across ERP, financial printing, benefit delivery and administration and financial services. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.

Roy Arsan is a solutions architect in the Splunk Partner Integrations team. He has a background in product development, cloud architecture, and building consumer and enterprise cloud applications. More recently, he has architected Splunk solutions on major cloud providers, including an AWS Quick Start for Splunk that enables AWS users to easily deploy distributed Splunk Enterprise straight from their AWS console. He’s also the co-author of the AWS Lambda blueprints for Splunk. He holds an M.S. in Computer Science Engineering from the University of Michigan.

Whether you want to review a Security, Compliance, & Identity track session you attended at AWS re:Invent 2017, or you want to experience a session for the first time, videos and slide decks from the Security, Compliance, & Identity track are now available.

Introductory

SID201: IAM for Enterprises: How Vanguard Strikes the Balance Between Agility, Governance, and Security

During Q4, Backblaze deployed 100 petabytes worth of Seagate hard drives to our data centers. The newly deployed Seagate 10 and 12 TB drives are doing well and will help us meet our near term storage needs, but we know we’re going to need more drives — with higher capacities. That’s why the success of new hard drive technologies like Heat-Assisted Magnetic Recording (HAMR) from Seagate are very relevant to us here at Backblaze and to the storage industry in general. In today’s guest post we are pleased to have Mark Re, CTO at Seagate, give us an insider’s look behind the hard drive curtain to tell us how Seagate engineers are developing the HAMR technology and making it market ready starting in late 2018.

What is HAMR and How Does It Enable the High-Capacity Needs of the Future?

Earlier this year Seagate announced plans to make the first hard drives using Heat-Assisted Magnetic Recording, or HAMR, available by the end of 2018 in pilot volumes. Even as today’s market has embraced 10TB+ drives, the need for 20TB+ drives remains imperative in the relative near term. HAMR is the Seagate research team’s next major advance in hard drive technology.

HAMR is a technology that over time will enable a big increase in the amount of data that can be stored on a disk. A small laser is attached to a recording head, designed to heat a tiny spot on the disk where the data will be written. This allows a smaller bit cell to be written as either a 0 or a 1. The smaller bit cell size enables more bits to be crammed into a given surface area — increasing the areal density of data, and increasing drive capacity.

It sounds almost simple, but the science and engineering expertise required, the research, experimentation, lab development and product development to perfect this technology has been enormous. Below is an overview of the HAMR technology and you can dig into the details in our technical brief that provides a point-by-point rundown describing several key advances enabling the HAMR design.

As much time and resources as have been committed to developing HAMR, the need for its increased data density is indisputable. Demand for data storage keeps increasing. Businesses’ ability to manage and leverage more capacity is a competitive necessity, and IT spending on capacity continues to increase.

History of Increasing Storage Capacity

For the last 50 years areal density in the hard disk drive has been growing faster than Moore’s law, which is a very good thing. After all, customers from data centers and cloud service providers to creative professionals and game enthusiasts rarely go shopping looking for a hard drive just like the one they bought two years ago. The demands of increasing data on storage capacities inevitably increase, thus the technology constantly evolves.

According to the Advanced Storage Technology Consortium, HAMR will be the next significant storage technology innovation to increase the amount of storage in the area available to store data, also called the disk’s “areal density.” We believe this boost in areal density will help fuel hard drive product development and growth through the next decade.

Why do we Need to Develop Higher-Capacity Hard Drives? Can’t Current Technologies do the Job?

Why is HAMR’s increased data density so important?

Data has become critical to all aspects of human life, changing how we’re educated and entertained. It affects and informs the ways we experience each other and interact with businesses and the wider world. IDC research shows the datasphere — all the data generated by the world’s businesses and billions of consumer endpoints — will continue to double in size every two years. IDC forecasts that by 2025 the global datasphere will grow to 163 zettabytes (that is a trillion gigabytes). That’s ten times the 16.1 ZB of data generated in 2016. IDC cites five key trends intensifying the role of data in changing our world: embedded systems and the Internet of Things (IoT), instantly available mobile and real-time data, cognitive artificial intelligence (AI) systems, increased security data requirements, and critically, the evolution of data from playing a business background to playing a life-critical role.

Consumers use the cloud to manage everything from family photos and videos to data about their health and exercise routines. Real-time data created by connected devices — everything from Fitbit, Alexa and smart phones to home security systems, solar systems and autonomous cars — are fueling the emerging Data Age. On top of the obvious business and consumer data growth, our critical infrastructure like power grids, water systems, hospitals, road infrastructure and public transportation all demand and add to the growth of real-time data. Data is now a vital element in the smooth operation of all aspects of daily life.

All of this entails a significant infrastructure cost behind the scenes with the insatiable, global appetite for data storage. While a variety of storage technologies will continue to advance in data density (Seagate announced the first 60TB 3.5-inch SSD unit for example), high-capacity hard drives serve as the primary foundational core of our interconnected, cloud and IoT-based dependence on data.

HAMR Hard Drive Technology

Seagate has been working on heat assisted magnetic recording (HAMR) in one form or another since the late 1990s. During this time we’ve made many breakthroughs in making reliable near field transducers, special high capacity HAMR media, and figuring out a way to put a laser on each and every head that is no larger than a grain of salt.

The development of HAMR has required Seagate to consider and overcome a myriad of scientific and technical challenges including new kinds of magnetic media, nano-plasmonic device design and fabrication, laser integration, high-temperature head-disk interactions, and thermal regulation.

A typical hard drive inside any computer or server contains one or more rigid disks coated with a magnetically sensitive film consisting of tiny magnetic grains. Data is recorded when a magnetic write-head flies just above the spinning disk; the write head rapidly flips the magnetization of one magnetic region of grains so that its magnetic pole points up or down, to encode a 1 or a 0 in binary code.

Increasing the amount of data you can store on a disk requires cramming magnetic regions closer together, which means the grains need to be smaller so they won’t interfere with each other.

Heat Assisted Magnetic Recording (HAMR) is the next step to enable us to increase the density of grains — or bit density. Current projections are that HAMR can achieve 5 Tbpsi (Terabits per square inch) on conventional HAMR media, and in the future will be able to achieve 10 Tbpsi or higher with bit patterned media (in which discrete dots are predefined on the media in regular, efficient, very dense patterns). These technologies will enable hard drives with capacities higher than 100 TB before 2030.

The major problem with packing bits so closely together is that if you do that on conventional magnetic media, the bits (and the data they represent) become thermally unstable, and may flip. So, to make the grains maintain their stability — their ability to store bits over a long period of time — we need to develop a recording media that has higher coercivity. That means it’s magnetically more stable during storage, but it is more difficult to change the magnetic characteristics of the media when writing (harder to flip a grain from a 0 to a 1 or vice versa).

That’s why HAMR’s first key hardware advance required developing a new recording media that keeps bits stable — using high anisotropy (or “hard”) magnetic materials such as iron-platinum alloy (FePt), which resist magnetic change at normal temperatures. Over years of HAMR development, Seagate researchers have tested and proven out a variety of FePt granular media films, with varying alloy composition and chemical ordering.

In fact the new media is so “hard” that conventional recording heads won’t be able to flip the bits, or write new data, under normal temperatures. If you add heat to the tiny spot on which you want to write data, you can make the media’s coercive field lower than the magnetic field provided by the recording head — in other words, enable the write head to flip that bit.

So, a challenge with HAMR has been to replace conventional perpendicular magnetic recording (PMR), in which the write head operates at room temperature, with a write technology that heats the thin film recording medium on the disk platter to temperatures above 400 °C. The basic principle is to heat a tiny region of several magnetic grains for a very short time (~1 nanoseconds) to a temperature high enough to make the media’s coercive field lower than the write head’s magnetic field. Immediately after the heat pulse, the region quickly cools down and the bit’s magnetic orientation is frozen in place.

Applying this dynamic nano-heating is where HAMR’s famous “laser” comes in. A plasmonic near-field transducer (NFT) has been integrated into the recording head, to heat the media and enable magnetic change at a specific point. Plasmonic NFTs are used to focus and confine light energy to regions smaller than the wavelength of light. This enables us to heat an extremely small region, measured in nanometers, on the disk media to reduce its magnetic coercivity,

Moving HAMR Forward

As always in advanced engineering, the devil — or many devils — is in the details. As noted earlier, our technical brief provides a point-by-point short illustrated summary of HAMR’s key changes.

Although hard work remains, we believe this technology is nearly ready for commercialization. Seagate has the best engineers in the world working towards a goal of a 20 Terabyte drive by 2019. We hope we’ve given you a glimpse into the amount of engineering that goes into a hard drive. Keeping up with the world’s insatiable appetite to create, capture, store, secure, manage, analyze, rapidly access and share data is a challenge we work on every day.

With thousands of HAMR drives already being made in our manufacturing facilities, our internal and external supply chain is solidly in place, and volume manufacturing tools are online. This year we began shipping initial units for customer tests, and production units will ship to key customers by the end of 2018. Prepare for breakthrough capacities.

Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.

For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.

Compiling Hail

Because Hail is still under active development, you must compile it before you can start using it. To help simplify the process, you can launch the following AWS CloudFormation template that creates an EMR cluster, compiles Hail, and installs a Jupyter Notebook so that you’re ready to go with Hail.

There are a few things to note about the AWS CloudFormation template. You must provide a password for the Jupyter Notebook. Also, you must provide a virtual private cloud (VPC) to launch Amazon EMR in, and make sure that you select a subnet from within that VPC. Next, update the cluster resources to fit your needs. Lastly, the HailBuildOutputS3Path parameter should be an Amazon S3 bucket/prefix, where you should save the compiled Hail binaries for later use. Leave the Hail and Spark versions as is, unless you’re comfortable experimenting with more recent versions.

When you’ve completed these steps, the following files are saved locally on the cluster to be used when running the Apache Spark Python API (PySpark) shell.

/home/hadoop/hail-python.zip/home/hadoop/hail-all-spark.jar

The files are also copied to the Amazon S3 location defined by the AWS CloudFormation template so that you can include them when running jobs using the Amazon EMR Step API.

Collecting genome data

To get started with Hail, use the 1000 Genome Project dataset available on AWS. The data that you will use is located at s3://1000genomes/release/20130502/.

For Hail to process these files in an efficient manner, they need to be block compressed. In many cases, files that use gzip compression are compressed in blocks, so you don’t need to recompress—you can just rename the file extension to “.bgz” from “.gz” . Hail can process .gz files, but it’s much slower and not recommended. The simple way to accomplish this is to copy the data files from the public S3 bucket to your own and rename them.

The following is the Bash command line to copy the first five genome Variant Call Format (VCF) files and rename them appropriately using the AWS CLI.

for i in $(seq 5); do aws s3 cp s3://1000genomes/release/20130502/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz s3://your_bucket/prefix/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz; done

Now that you have some data files containing variants in the Variant Call Format, you need to get the sample annotations that go along with them. These annotations provide more information about each sample, such as the population they are a part of.

Exploring the genomic data

In this section, you use the data collected in the previous section to explore genome variations interactively using a Jupyter Notebook. You then create a simple ETL job to convert these variations into Parquet format. Finally, you query it using Amazon Athena.

Let’s open the Jupyter notebook. To start, sign in to the AWS Management Console, and open the AWS CloudFormation console. Choose the stack that you created, and then choose the Output tab. There you see the JupyterURL. Open this URL in your browser.

Go ahead and download the Jupyter Notebook that is provided to your local machine. Log in to Jupyter with the password that you provided during stack creation. Choose Upload on the right side, and then choose the notebook from your local machine.

After the notebook is uploaded, choose it from the list on the left to open it.

Select the first cell, update the S3 bucket location to point to the bucket where you saved the compiled Hail libraries, and then choose Run. This code imports the Hail modules that you compiled at the beginning. When the cell is executing, you will see In [*]. When the process is complete, the asterisk (*) is replaced by a number, for example, In [1].

Next, run the subsequent two cells, which imports the Hail module into PySpark and initiates the Hail context.

The next cell imports a single VCF file from the bucket where you saved your data in the previous section. If you change the Amazon S3 path to not include a file name, it imports all the VCF files in that directory. Depending on your cluster size, it might take a few minutes.

Remember that in the previous section, you also copied an annotation file. Now you use it to annotate the VCF files that you’ve loaded with Hail. Execute the next cell—as a shortcut, you can select the cell and press Shift+Enter.

The import_table API takes a path to the annotation file in TSV (tab-separated values) format and a parameter named impute that attempts to infer the schema of the file, as shown in the output below the cell.

At this point, you can interactively explore the data. For example, you can count the number of samples you have and group them by population.

You can also calculate the standard quality control (QC) metrics on your variants and samples.

What if you want to query this data outside of Hail and Spark, for example, using Amazon Athena? To start, you need to change the column names to lowercase because Athena currently supports only lowercase names. To do that, use the two functions provided in the notebook and call them on your virtual dedicated server (VDS), as shown in the following image. Note that you’re only changing the case of the variants and samples schemas. If you’ve further augmented your VDS, you might need to modify the lowercase functions to do the same for those schemas.

In the current version of Hail, the sample annotations are not stored in the exported Parquet VDS, so you need to save them separately. As noted by the Hail maintainers, in future versions, the data represented by the VDS Parquet output will change, and it is recommended that you also export the variant annotations. So let’s do that.

Note that both of these lines are similar in that they export a table representation of the sample and variant annotations, convert them to a Spark DataFrame, and finally save them to Amazon S3 in Parquet file format.

Finally, it is beneficial to save the VDS file back to Amazon S3 so that next time you need to analyze your data, you can load it without having to start from the raw VCF. Note that when Hail saves your data, it requires a path and a file name.

After you run these cells, expect it to take some time as it writes out the data.

Discovering table metadata

Before you can query your data, you need to tell Athena the schema of your data. You have a couple of options. The first is to use AWS Glue to crawl the S3 bucket, infer the schema from the data, and create the appropriate table. Before proceeding, you might need to migrate your Athena database to use the AWS Glue Data Catalog.

Creating tables in AWS Glue

To use the AWS Glue crawler, open the AWS Glue console and choose Crawlers in the left navigation pane.

Then choose Add crawler to create a new crawler.

Next, give your crawler a name and assign the appropriate IAM role. Leave Amazon S3 as the data source, and select the S3 bucket where you saved the data and the sample annotations. When you set the crawler’s Include path, be sure to include the entire path, for example: s3://output_bucket/hail_data/sample_annotations/

Under the Exclusion Paths, type _SUCCESS, so that you don’t crawl that particular file.

Continue forward with the default settings until you are asked if you want to add another source. Choose Yes, and add the Amazon S3 path to the variant annotation bucket s3://your_output_bucket/hail_data/sample_annotations/ so that it can build your variant annotation table. Give it an existing database name, or create a new one.

Provide a table prefix and choose Next. Then choose Finish. At this point, assuming that the data is finished writing, you can go ahead and run the crawler. When it finishes, you have two new tables in the database you created that should look something like the following:

You can explore the schema of these tables by choosing their name and then choosing Edit Schema on the right side of the table view; for example:

Creating tables in Amazon Athena

If you cannot or do not want to use AWS Glue crawlers, you can add the tables via the Athena console by typing the following statements:

Querying annotations with Athena

In the Amazon Athena console, choose the database in which your tables were created. In this case, it looks something like the following:

To verify that you have data, choose the three dots on the right, and then choose Preview table.

Indeed, you can see some data.

You can further explore the sample and variant annotations along with the calculated QC metrics that you calculated previously using Hail.

Summary

To summarize, this post demonstrated the ease in which you can install, configure, and use Hail, an open source highly scalable framework for exploring and analyzing genomics data on Amazon EMR. We demonstrated setting up a Jupyter Notebook to make our exploration easy. We also used the power of Hail to calculate quality control metrics for variants and samples. We exported them to Amazon S3 and allowed a broader range of users and analysts to explore them on-demand in a serverless environment using Amazon Athena.

About the Author

Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.

“I want my company to innovate, but I am not convinced we can execute successfully.” Far too many times I have heard this fear expressed by senior executives that I have met at different points in my career. In fact, a recent study published by Price Waterhouse Coopers found that while 93% of executives depend on innovation to drive growth, more than half are challenged to take innovative ideas to market quickly in a scalable way.

Many customers are struggling with how to drive enterprise innovation, so I was thrilled to share the stage at AWS re:Invent this past week with several senior executives who have successfully broken this mold to drive amazing enterprise innovation. In particular, I want to thank Parag Karnik from Johnson & Johnson, Bill Rothe from Hess Corporation, Dave Williams from Just Eat, and Olga Lagunova from Pitney Bowes for sharing their stories of innovation, creativity, and solid execution.

Among the many new announcements from AWS this past week, I am particularly excited about the following newly-launched AWS products and programs that I announced at re:Invent to drive new innovations by our enterprise customers:

AI: New Deep Learning Amazon Machine Image (AMI) on EC2 Windows As I shared at re:Invent, customers such as Infor are already successfully leveraging artificial intelligence tools on AWS to deliver tailored, industry-specific applications to their customers. We want to facilitate more of our Windows developers to get started quickly and easily with AI, leveraging machine learning based tools with popular deep learning frameworks, such as Apache MXNet, TensorFlow, and Caffe2. In order to enable this, I announced at re:Invent that AWS now offers a new Deep Learning AMI for Microsoft Windows. The AMI is tailored to facilitate large scale training of deep-learning models, and enables quick and easy setup of Windows Server-based compute resources for machine learning applications.

IoT: Visualize and Analyze SQL and IoT Data Forecasts show as many as 31 billion IoT devices by 2020. AWS wants every Windows customer to take advantage of the data available from their devices. Pitney Bowes, for example, now has more than 130,000 IoT devices streaming data to AWS. Using machine learning, Pitney Bowes enriches and analyzes data to enhance their customer experience, improve efficiencies, and create new data products. AWS IoT Analytics can now be leveraged to run analytics on IoT data and get insights that help you make better and more accurate decisions for IoT applications and machine learning use cases. AWS IoT Analytics can automatically enrich IoT device data with contextual metadata such as your SQL Server transactional data.

New Capabilities for .NET Developers on AWS In addition to all of the enhancements we’ve introduced to deliver a first class experience to Windows developers on AWS, we announced that we are including .NET Core 2.0 support in AWS Lambda and AWS CodeBuild, which will be available for broader use early next year. .NET Core 2.0 packs a number of new features such as Razor pages, better compatibility with .NET framework, more than double the number of APIs compared to the previous versions, and much more. With this announcement, you will be able to take advantage of all latest .NET Core features on Lambda and CodeBuild for building modern serverless and DevOps centric solutions.

License optimization for BYOL AWS provides you a wide variety of instance types and families that best meet your workload needs. If you are using software licensed by the number of vCPUs, you want the ability to further tweak vCPU count to optimize license spend. I announced the upcoming ability to optimize CPUs for EC2, giving you greater control over your EC2 instances on two fronts:

You can specify a custom number of vCPUs when launching new instances to save on vCPU based licensing costs. For example, SQL Server licensing spend.

You can disable Hyper-Threading Technology for workloads that perform well with single-threaded CPUs, like some high-performance computing (HPC) applications.

Using these capabilities, customers who bring their own license (BYOL) will be able to optimize their license usage and save on the license costs.

Server Migration Service for Hyper-V Virtual Machines As Bill Rothe from Hess Corporation shared at re:Invent, Hess has successfully migrated a wide range of workloads to the cloud, including SQL Server, SharePoint, SAP HANA, and many others. AWS Server Migration Service (SMS) now supports Hyper-V virtual machine (VM) migration, in order to further support enterprise migrations like these. AWS Server Migration Service will enable you to more easily co-ordinate large-scale server migrations from on-premise Hyper-V environments to AWS. AWS Server Migration Service allows you to automate, schedule, and track incremental replications of live server volumes. The replicated volumes are encrypted in transit and saved as a new Amazon Machine Image (AMI), which can be launched as an EC2 instance on AWS.

Microsoft Premier Support for AWS End-Customers I was pleased to announce that Microsoft and AWS have developed new areas of support integration to help ensure a great customer experience. Microsoft Premier Support is on board to help AWS assist end customers. AWS Support engineers can escalate directly to Microsoft Support on behalf of AWS customers running Microsoft workloads.

Best Practice Tools: HIPAA Compliance and Digital Innovation Workshop In November, we updated our HIPAA-focused white paper, outlining how you can use AWS to create HIPAA-compliant applications. In the first quarter of next year, we will publish a HIPAA Implementation Guide that expands on our HIPAA Quick Start to enable you to follow strict security, compliance, and risk management controls for common healthcare use cases. I was also pleased to award a Digital Innovation Workshop to one of our customers in my re:Invent session, and look forward to seeing more customers take advantage of this workshop.

AWS: The Continuous Innovation Cloud A common thread we see across customers is that continuous innovation from AWS enables their ongoing reinvention. Continuous innovation means that you are always getting a newer, better offering every single day. Sometimes it is in the form of brand new services and capabilities, and sometimes it is happening invisibly, under the covers where your environment just keeps getting better. I invite you to learn more about how you can accelerate your innovation journey with recently launched AWS services and AWS best practices. If you are migrating Windows workloads, speak with your AWS sales representative or an AWS Microsoft Workloads Competency Partner to learn how you can leverage our re:Think for Windows program for credits to start your migration.

Tags

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.