Wherever Google has a presence, rightsholders are around to accuse the search giant of not doing enough to deal with piracy.

Over the past several years, the company has been attacked by both the music and movie industries but despite overtures from Google, criticism still floods in.

In Australia, things are definitely heating up. Village Roadshow, one of the nation’s foremost movie companies, has been an extremely vocal Google critic since 2015 but now its co-chief, the outspoken Graham Burke, seems to want to take things to the next level.

As part of yet another broadside against Google, Burke has for the second time in a month accused Google of playing a large part in online digital crime.

“My view is they are complicit and they are facilitating crime,” Burke said, adding that if Google wants to sue him over his comments, they’re very welcome to do so.

It’s highly unlikely that Google will take the bait. Burke’s attempt at pushing the issue further into the spotlight will have been spotted a mile off but in any event, legal battles with Google aren’t really something that Burke wants to get involved in.

Australia is currently in the midst of a consultation process for the Copyright Amendment (Service Providers) Bill 2017 which would extend the country’s safe harbor provisions to a broader range of service providers including educational institutions, libraries, archives, key cultural institutions and organizations assisting people with disabilities.

For its part, Village Roadshow is extremely concerned that these provisions may be extended to other providers – specifically Google – who might then use expanded safe harbor to deflect more liability in respect of piracy.

“Village Roadshow….urges that there be no further amendments to safe harbor and in particular there is no advantage to Australia in extending safe harbor to Google,” Burke wrote in his company’s recent submission to the government.

“It is very unlikely given their size and power that as content owners we would ever sue them but if we don’t have that right then we stand naked. Most importantly if Google do the right thing by Australia on the question of piracy then there will be no issues. However, they are very far from this position and demonstrably are facilitating crime.”

Accusations of crime facilitation are nothing new for Google, with rightsholders in the US and Europe having accused the company of the same a number of times over the years. In response, Google always insists that it abides by relevant laws and actually goes much further in tackling piracy than legislation currently requires.

On the safe harbor front, Google begins by saying that not expanding provisions to service providers will have a seriously detrimental effect on business development in the region.

“[Excluding] online service providers falls far short of a balanced, pro-innovation environment for Australia. Further, it takes Australia out of step with other digital economies by creating regulatory uncertainty for [venture capital] investment and startup/entrepreneurial success,” Google’s submission reads.

“Under the new scheme, Australian-based startups and service providers, unlike their international counterparts, will not receive clear and consistent legal protection when they respond to complaints from rightsholders about alleged instances of online infringement by third-party users on their services,” Google notes.

Interestingly, Google then delivers what appears to be a loosely veiled threat.

One of the key anti-piracy strategies touted by the mainstream entertainment companies is collaboration between rightsholders and service providers, including the latter providing voluntary tools to police infringement online. Google says that if service providers are given a raw deal on safe harbor, the extent of future cooperation may be at risk.

“If Australian-based service providers are carved out of the new safe harbor regime post-reform, they will operate from a lower incentive to build and test new voluntary tools to combat online piracy, potentially reducing their contributions to innovation in best practices in both Australia and international markets,” the company warns.

But while Village Roadshow argue against safe harbors and warn that piracy could kill the movie industry, it is quietly optimistic that the tide is turning.

In a presentation to investors last week, the company said that reducing piracy would have “only an upside” for its business but also added that new research indicates that “piracy growth [is] getting arrested.” As a result, the company says that it will build on the notion that “74% of people see piracy as ‘wrong/theft’” and will call on Australians to do the right thing.

In the meantime, the pressure on Google will continue but lawsuits – in either direction – won’t provide an answer.

We’re just over three weeks away from the Raspberry Jam Big Birthday Weekend 2018, our community celebration of Raspberry Pi’s sixth birthday. Instead of an event in Cambridge, as we’ve held in the past, we’re coordinating Raspberry Jam events to take place around the world on 3–4 March, so that as many people as possible can join in. Well over 100 Jams have been confirmed so far.

Take a look at the events map and the full list (including those who haven’t added their event to the map quite yet).

We will have Raspberry Jams in 35 countries across six continents

Birthday kits

We had some special swag made especially for the birthday, including these T-shirts, which we’ve sent to Jam organisers:

There is also a poster with a list of participating Jams, which you can download:

Raspberry Jam photo booth

I created a Raspberry Jam photo booth that overlays photos with the Big Birthday Weekend logo and then tweets the picture from your Jam’s account — you’ll be seeing plenty of those if you follow the #PiParty hashtag on 3–4 March.

Check out the project on GitHub, and feel free to set up your own booth, or modify it to your own requirements. We’ve included text annotations in several languages, and more contributions are very welcome.

There’s still time…

If you can’t find a Jam near you, there’s still time to organise one for the Big Birthday Weekend. All you need to do is find a venue — a room in a school or library will do — and think about what you’d like to do at the event. Some Jams have Raspberry Pis set up for workshops and practical activities, some arrange tech talks, some put on show-and-tell — it’s up to you. To help you along, there’s the Raspberry Jam Guidebook full of advice and tips from Jam organisers.

The packed. And they packed. And they packed some more. Who’s expecting one of these #rjam kits for the Raspberry Jam Big Birthday Weekend?

Download the Raspberry Jam branding pack, and the special birthday branding pack, where you’ll find logos, graphical assets, flyer templates, worksheets, and more. When you’re ready to announce your event, create a webpage for it — you can use a site like Eventbrite or Meetup — and submit your Jam to us so it will appear on the Jam map!

We are six

We’re really looking forward to celebrating our birthday with thousands of people around the world. Over 48 hours, people of all ages will come together at more than 100 events to learn, share ideas, meet people, and make things during our Big Birthday Weekend.

Since we released the first Raspberry Pi in 2012, we’ve sold 17 million of them. We’re also reaching almost 200000 children in 130 countries around the world through Code Club and CoderDojo, we’ve trained over 1500 Raspberry Pi Certified Educators, and we’ve sent code written by more than 6800 children into space. Our magazines are read by a quarter of a million people, and millions more use our free online learning resources. There’s plenty to celebrate and even more still to do: we really hope you’ll join us from a Jam near you on 3–4 March.

It was an accident

The most important fact about Wannacry is that it was an accident. We’ve had 30 years of experience with Internet worms teaching us that worms are always accidents. While launching worms may be intentional, their effects cannot be predicted. While they appear to have targets, like Slammer against South Korea, or Witty against the Pentagon, further analysis shows this was just a random effect that was impossible to predict ahead of time. Only in hindsight are these effects explainable.

We should hold those causing accidents accountable, too, but it’s a different accountability. The U.S. has caused more civilian deaths in its War on Terror than the terrorists caused triggering that war. But we hold these to be morally different: the terrorists targeted the innocent, whereas the U.S. takes great pains to avoid civilian casualties.

Since we are talking about blaming those responsible for accidents, we also must include the NSA in that mix. The NSA created, then allowed the release of, weaponized exploits. That’s like accidentally dropping a load of unexploded bombs near a village. When those bombs are then used, those having lost the weapons are held guilty along with those using them. Yes, while we should blame the hacker who added ETERNAL BLUE to their ransomware, we should also blame the NSA for losing control of ETERNAL BLUE.

A country and its assets are different

Was it North Korea, or hackers affilliated with North Korea? These aren’t the same.

It’s hard for North Korea to have hackers of its own. It doesn’t have citizens who grow up with computers to pick from. Moreover, an internal hacking corps would create tainted citizens exposed to dangerous outside ideas. Update: Some people have pointed out that Kim Il-sung University in the capital does have some contact with the outside world, with academics granted limited Internet access, so I guess some tainting is allowed. Still, what we know of North Korea hacking efforts largley comes from hackers they employ outside North Korea. It was the Lazurus Group, outside North Korea, that did Wannacry.

Instead, North Korea develops external hacking “assets”, supporting several external hacking groups in China, Japan, and South Korea. This is similar to how intelligence agencies develop human “assets” in foreign countries. While these assets do things for their handlers, they also have normal day jobs, and do many things that are wholly independent and even sometimes against their handler’s interests.

For example, this Muckrock FOIA dump shows how “CIA assets” independently worked for Castro and assassinated a Panamanian president. That they also worked for the CIA does not make the CIA responsible for the Panamanian assassination.

That CIA/intelligence assets work this way is well-known and uncontroversial. The fact that countries use hacker assets like this is the controversial part. These hackers do act independently, yet we refuse to consider this when we want to “attribute” attacks.

Attribution is political

We have far better attribution for the nPetya attacks. It was less accidental (they clearly desired to disrupt Ukraine), and the hackers were much closer to the Russian government (Russian citizens). Yet, the Trump administration isn’t fighting Russia, they are fighting North Korea, so they don’t officially attribute nPetya to Russia, but do attribute Wannacry to North Korea.

Trump is in conflict with North Korea. He is looking for ways to escalate the conflict. Attributing Wannacry helps achieve his political objectives.

That it was blatantly politics is demonstrated by the way it was released to the press. It wasn’t released in the normal way, where the administration can stand behind it, and get challenged on the particulars. Instead, it was pre-released through the normal system of “anonymous government officials” to the NYTimes, and then backed up with op-ed in the Wall Street Journal. The government leaks information like this when it’s weak, not when its strong.

The proper way is to release the evidence upon which the decision was made, so that the public can challenge it. Among the questions the public would ask is whether it they believe it was North Korea’s intention to cause precisely this effect, such as disabling the British NHS. Or, whether it was merely hackers “affiliated” with North Korea, or hackers carrying out North Korea’s orders. We cannot challenge the government this way because the government intentionally holds itself above such accountability.

Conclusion

We believe hacking groups tied to North Korea are responsible for Wannacry. Yet, even if that’s true, we still have three attribution problems. We still don’t know if that was intentional, in pursuit of some political goal, or an accident. We still don’t know if it was at the direction of North Korea, or whether their hacker assets acted independently. We still don’t know if the government has answers to these questions, or whether it’s exploiting this doubt to achieve political support for actions against North Korea.

This is a guest post by Yukinori Koide, an the head of development for the Newspass department at Gunosy.

Gunosy is a news curation application that covers a wide range of topics, such as entertainment, sports, politics, and gourmet news. The application has been installed more than 20 million times.

Gunosy aims to provide people with the content they want without the stress of dealing with a large influx of information. We analyze user attributes, such as gender and age, and past activity logs like click-through rate (CTR). We combine this information with article attributes to provide trending, personalized news articles to users.

Why does Gunosy need real-time processing?

Users need fresh and personalized news. There are two constraints to consider when delivering appropriate articles:

Time: Articles have freshness—that is, they lose value over time. New articles need to reach users as soon as possible.

Frequency (volume): Only a limited number of articles can be shown. It’s unreasonable to display all articles in the application, and users can’t read all of them anyway.

To deliver fresh articles with a high probability that the user is interested in them, it’s necessary to include not only past user activity logs and some feature values of articles, but also the most recent (real-time) user activity logs.

We optimize the delivery of articles with these two steps.

Personalization: Deliver articles based on each user’s attributes, past activity logs, and feature values of each article—to account for each user’s interests.

Optimizing the delivery of articles is always a cold start. Initially, we deliver articles based on past logs. We then use real-time data to optimize as quickly as possible. In addition, news has a short freshness time. Specifically, day-old news is past news, and even the news that is three hours old is past news. Therefore, shortening the time between step 1 and step 2 is important.

To tackle this issue, we chose AWS for processing streaming data because of its fully managed services, cost-effectiveness, and so on.

Solution

The following diagrams depict the architecture for optimizing article delivery by processing real-time user activity logs

There are three processing flows:

Process real-time user activity logs.

Store and process all user-based and article-based logs.

Execute ad hoc or heavy queries.

In this post, I focus on the first processing flow and explain how it works.

Process real-time user activity logs

The following are the steps for processing user activity logs in real time using Kinesis Data Streams and Kinesis Data Analytics.

The Fluentd server sends the following user activity logs to Kinesis Data Streams:

Batch servers get results from Amazon ES every minute. They then optimize delivering articles with other data sources using a proprietary optimization algorithm.

How to connect a stream to another stream in another AWS Region

When we built the solution, Kinesis Data Analytics was not available in the Asia Pacific (Tokyo) Region, so we used the US West (Oregon) Region. The following shows how we connected a data stream to another data stream in the other Region.

There is no need to continue containing all components in a single AWS Region, unless you have a situation where a response difference at the millisecond level is critical to the service.

Benefits

The solution provides benefits for both our company and for our users. Benefits for the company are cost savings—including development costs, operational costs, and infrastructure costs—and reducing delivery time. Users can now find articles of interest more quickly. The solution can process more than 500,000 records per minute, and it enables fast and personalized news curating for our users.

Conclusion

In this post, I showed you how we optimize trending user activities to personalize news using Amazon Kinesis Data Firehose, Amazon Kinesis Data Analytics, and related AWS services in Gunosy.

AWS gives us a quick and economical solution and a good experience.

If you have questions or suggestions, please comment below.

Additional Reading

About the Authors

Yukinori Koide is the head of development for the Newspass department at Gunosy. He is working on standardization of provisioning and deployment flow, promoting the utilization of serverless and containers for machine learning and AI services. His favorite AWS services are DynamoDB, Lambda, Kinesis, and ECS.

Akihiro Tsukada is a start-up solutions architect with AWS. He supports start-up companies in Japan technically at many levels, ranging from seed to later-stage.

Yuta Ishii is a solutions architect with AWS. He works with our customers to provide architectural guidance for building media & entertainment services, helping them improve the value of their services when using AWS.

Last week I attended a talk given by Bryan Mistele, president of Seattle-based INRIX. Bryan’s talk provided a glimpse into the future of transportation, centering around four principle attributes, often abbreviated as ACES:

Autonomous – Cars and trucks are gaining the ability to scan and to make sense of their environments and to navigate without human input.

Connected – Vehicles of all types have the ability to take advantage of bidirectional connections (either full-time or intermittent) to other cars and to cloud-based resources. They can upload road and performance data, communicate with each other to run in packs, and take advantage of traffic and weather data.

Electric – Continued development of battery and motor technology, will make electrics vehicles more convenient, cost-effective, and environmentally friendly.

Individually and in combination, these emerging attributes mean that the cars and trucks we will see and use in the decade to come will be markedly different than those of the past.

On the Road with AWS AWS customers are already using our AWS IoT, edge computing, Amazon Machine Learning, and Alexa products to bring this future to life – vehicle manufacturers, their tier 1 suppliers, and AutoTech startups all use AWS for their ACES initiatives. AWS Greengrass is playing an important role here, attracting design wins and helping our customers to add processing power and machine learning inferencing at the edge.

AWS customer Aptiv (formerly Delphi) talked about their Automated Mobility on Demand (AMoD) smart vehicle architecture in a AWS re:Invent session. Aptiv’s AMoD platform will use Greengrass and microservices to drive the onboard user experience, along with edge processing, monitoring, and control. Here’s an overview:

Another customer, Denso of Japan (one of the world’s largest suppliers of auto components and software) is using Greengrass and AWS IoT to support their vision of Mobility as a Service (MaaS). Here’s a video:

AWS at CES The AWS team will be out in force at CES in Las Vegas and would love to talk to you. They’ll be running demos that show how AWS can help to bring innovation and personalization to connected and autonomous vehicles.

Personalized In-Vehicle Experience – This demo shows how AWS AI and Machine Learning can be used to create a highly personalized and branded in-vehicle experience. It makes use of Amazon Lex, Polly, and Amazon Rekognition, but the design is flexible and can be used with other services as well. The demo encompasses driver registration, login and startup (including facial recognition), voice assistance for contextual guidance, personalized e-commerce, and vehicle control. Here’s the architecture for the voice assistance:

Connected Vehicle Solution – This demo shows how a connected vehicle can combine local and cloud intelligence, using edge computing and machine learning at the edge. It handles intermittent connections and uses AWS DeepLens to train a model that responds to distracted drivers. Here’s the overall architecture, as described in our Connected Vehicle Solution:

Digital Content Delivery – This demo will show how a customer uses a web-based 3D configurator to build and personalize their vehicle. It will also show high resolution (4K) 3D image and an optional immersive AR/VR experience, both designed for use within a dealership.

Autonomous Driving – This demo will showcase the AWS services that can be used to build autonomous vehicles. There’s a 1/16th scale model vehicle powered and driven by Greengrass and an overview of a new AWS Autonomous Toolkit. As part of the demo, attendees drive the car, training a model via Amazon SageMaker for subsequent on-board inferencing, powered by Greengrass ML Inferencing.

To speak to one of my colleagues or to set up a time to see the demos, check out the Visit AWS at CES 2018 page.

Some Resources If you are interested in this topic and want to learn more, the AWS for Automotive page is a great starting point, with discussions on connected vehicles & mobility, autonomous vehicle development, and digital customer engagement.

When you are ready to start building a connected vehicle, the AWS Connected Vehicle Solution contains a reference architecture that combines local computing, sophisticated event rules, and cloud-based data processing and storage. You can use this solution to accelerate your own connected vehicle projects.

Happy 2018! I am looking forward to getting back to my usual routine, working with our teams to learn about their upcoming launches and then writing blog posts to bring the news to you. Right now I am still catching up on a few launches and announcements from late 2017.

First on the list for today is our most recent round of new cities for AWS Direct Connect. AWS customers all over the world use Direct Connect to create dedicated network connections from their premises to AWS in order to reduce their network costs, increase throughput, and to pursue a more consistent network experience.

We added ten new locations to our Direct Connect roster in December, all of which offer both 1 Gbps and 10 Gbps connectivity, along with partner-supplied options for speeds below 1 Gbps. Here are the newest locations, along withe the data centers and associated AWS Regions:

You can use these new locations in conjunction with the AWS Direct Connect Gateway to set up connectivity that spans Virtual Private Clouds (VPCs) spread across multiple AWS Regions (this does not apply to the AWS Regions in China).

If you are interested in putting Direct Connect to use, be sure to check out our ever-growing list of Direct Connect Partners.

We recently launched AWS Architecture Monthly, a new subscription service on Kindle that will push a selection of the best content around cloud architecture from AWS, with a few pointers to other content you might also enjoy.

From building a simple website to crafting an AI-based chat bot, the choices of technologies and the best practices in how to apply them are constantly evolving. Our goal is to supply you each month with a broad selection of the best new tech content from AWS — from deep-dive tutorials to industry-trend articles.

At this time, Architecture Monthly annual subscriptions are only available in the France (new), US, UK, and Germany. As more countries become available, we’ll update you here on the blog. For Amazon.com countries not listed above, we are offering single-issue downloads — also accessible from our landing page. The content is the same as in the subscription but requires individual-issue downloads.

FAQI have to submit my credit card information for a free subscription? While you do have to submit your card information at this time (as you would for a free book in the Kindle store), it won’t be charged. This will remain a free, annual subscription and includes all 10 issues for the year.

Why isn’t the subscription available everywhere? As new countries get added to Kindle Newsstand, we’ll ensure we add them for Architecture Monthly. This month we added France but anticipate it will take some time for the new service to move into additional markets.

What countries are included in the Amazon.com list where the issues can be downloaded? Andorra, Australia, Austria, Belgium, Brazil, Canada, Gibraltar, Guernsey, India, Ireland, Isle of Man, Japan, Jersey, Liechtenstein, Luxembourg, Mexico, Monaco, Netherlands, New Zealand, San Marino, Spain, Switzerland, Vatican City

If someone wants to obtain the latest movies for free, all they need to do is head over to the nearest torrent or streaming portal, press a few buttons, and the content appears in a matter of seconds or minutes, dependent on choice.

Indeed, for those seeking mainstream content DRM-free, this is the only way to obtain it, since studios generally don’t make their content available in this fashion. But we know an establishment that does, on a grand scale.

University College London is the third largest university in the UK. According to accounts (pdf) published this summer, it has revenues of more than £1.32 billion. Somewhat surprisingly, this educational behemoth also has a sensational multimedia trick up its considerable sleeve.

The university’s website, located at UCL.ac.uk, is a polished affair and provides all the information anyone could need. However, until one browses to the Self-Access Centre, the full glory of the platform remains largely hidden.

Located at resources.clie.ucl.ac.uk/home/sac/english/films, it looks not unlike Netflix, or indeed any one of thousands of pirate streaming sites around today. However, it appears to be intended for university and educational use only.

UCL’s Self-Access Centre

“Welcome to the Self-Access Centre materials database. Here you can find out about the English materials we have in the SAC and explore our online materials,” the site reads.

“They were designed to help you improve your English skills. Most of the video materials, including films and documentaries, are now available to be watched online. Log on with your UCL id and password to watch them!”

According to a university video tutorial, all content on the SAC can be viewed on campus or from home, as long as a proper login and password is entered. The material is provided for educational purposes and when viewed through the portal, is accompanied by questions, notes, and various exercises.

Trouble is, the entire system is open to the wider Internet, with no logins or passwords required.

A sample of the movies on offer for direct download

The above image doesn’t even begin to scratch the surface. In one directory alone, TorrentFreak counted more than 700 English language movies. In another, more than 600 documentaries including all episodes of the BBC’s Blue Planet II. World Cinema produced close to 90 results, with hundreds of titles voiced in languages from Arabic to Japanese to Welsh.

Links can be pasted into VLC and streamed direct

Quite how long this massive trove of films and TV shows has been open to the public isn’t clear but a simple Google search reveals not only the content itself, but also links to movies and other material on sites in the Middle East and social networks in Russia.

Some of them date back to at least 2016 so it’s probably safe to assume that untold terabytes of data have already been liberated from the university’s servers for the pleasure of the public.

So Don Jr. tweets the following, which is an excellent troll. So I thought I’d bite. The reason is I just got through debunk Democrat claims about NetNeutrality, so it seems like a good time to balance things out and debunk Trump nonsense.The issue here is not which side is right. The issue here is whether you stand for truth, or whether you’ll seize any factoid that appears to support your side, regardless of the truthfulness of it. The ACLU obviously chose falsehoods, as I documented. In the following tweet, Don Jr. does the same.

It’s a preview of the hyperpartisan debates are you are likely to have across the dinner table tomorrow, which each side trying to outdo the other in the false-hoods they’ll claim.

Stock markets at all time highsLowest jobless claims since 736 TRILLION added to economy since Election1.5M fewer people on food stampsConsumer confidence through roof Lowest Unemployment rate in 17 years #maga

What we see in this number is a steady trend of these statistics since the Great Recession, with no evidence in the graphs showing how Trump has influenced these numbers, one way or the other.

Stock markets at all time highs

This is true, but it’s obviously not due to Trump. The stock markers have been steadily rising since the Great Recession. Trump has done nothing substantive to change the market trajectory. Also, he hasn’t inspired the market to change it’s direction.

To be fair to Don Jr., we’ve all been crediting (or blaming) presidents for changes in the stock market despite the fact they have almost no influence over it. Presidents don’t run the economy, it’s an inappropriate conceit. The most influence they’ve had is in harming it.

Lowest jobless claims since 73

Again, let’s graph this:

As we can see, jobless claims have been on a smooth downward trajectory since the Great Recession. It’s difficult to see here how President Trump has influenced these numbers.

6 Trillion added to the economy

What he’s referring to is that assets have risen in value, like the stock market, homes, gold, and even Bitcoin.

But this is a well known fallacy known as Mercantilism, believing the “economy” is measured by the value of its assets. This was debunked by Adam Smith in his book “The Wealth of Nations“, where he showed instead the the “economy” is measured by how much it produces (GDP – Gross Domestic Product) and not assets.

GDP has grown at 3.0%, which is pretty good compared to the long term trend, and is better than Europe or Japan (though not as good as China). But Trump doesn’t deserve any credit for this — today’s rise in GDP is the result of stuff that happened years ago.

Assets have risen by $6 trillion, but that’s not a good thing. After all, when you sell your home for more money, the buyer has to pay more. So one person is better off and one is worse off, so the net effect is zero.

Actually, such asset price increase is a worrisome indicator — we are entering into bubble territory. It’s the result of a loose monetary policy, low interest rates and “quantitative easing” that was designed under the Obama administration to stimulate the economy. That’s why all assets are rising in value. Normally, a rise in one asset means a fall in another, like selling gold to pay for houses. But because of loose monetary policy, all assets are increasing in price. The amazing rise in Bitcoin over the last year is as much a result of this bubble growing in all assets as it is to an exuberant belief in Bitcoin.

When this bubble collapses, which may happen during Trump’s term, it’ll really be the Obama administration who is to blame. I mean, if Trump is willing to take credit for the asset price bubble now, I’m willing to give it to him, as long as he accepts the blame when it crashes.

1.5 million fewer people on food stamps

As you’d expect, I’m going to debunk this with a graph: the numbers have been falling since the great recession. Indeed, in the previous period under Obama, 1.9 fewer people got off food stamps, so Trump’s performance is slight ahead rather than behind Obama. Of course, neither president is really responsible.

Consumer confidence through the roof

Again we find nothing in the graph that suggests President Trump is responsible for any change — it’s been improving steadily since the Great Recession.

One thing to note is that, technically, it’s not “through the roof” — it still quite a bit below the roof set during the dot-com era.

Lowest Unemployment rate in 17 years

Again, let’s simply graph it over time and look for Trump’s contribution. as we can see, there doesn’t appear to be anything special Trump has done — unemployment has steadily been improving since the Great Recession.

But here’s the thing, the “unemployment rate” only measures those looking for work, not those who have given up. The number that concerns people more is the “labor force participation rate”. The Great Recession kicked a lot of workers out of the economy.

Mostly this is because Baby Boomer are now retiring an leaving the workforce, and some have chosen to retire early rather than look for another job. But there are still some other problems in our economy that cause this. President Trump has nothing particular in order to solve these problems.

Conclusion

As we see, Don Jr’s tweet is a troll. When we look at the graphs of these indicators going back to the Great Recession, we don’t see how President Trump has influenced anything. The improvements this year are in line with the improvements last year, which are in turn inline with the improvements in the previous year.

To be fair, all parties credit their President with improvements during their term. President Obama’s supporters did the same thing. But at least right now, with these numbers, we can see that there’s no merit to anything in Don Jr’s tweet.

The hyperpartisan rancor in this country is because neither side cares about the facts. We should care. We should care that these numbers suck, even if we are Republicans. Conversely, we should care that those NetNeutrality claims by Democrats suck, even if we are Democrats.

Get ready for takeoff!

We made sure that this year’s re:Invent is chock-full of containers: there are over 40 sessions! New to containers? No problem, we have several introductory sessions for you to dip your toes. Been using containers for years and know the ins and outs? Don’t miss our technical deep-dives and interactive chalk talks led by container experts.

If you can’t make it to Las Vegas, you can catch the keynotes and session recaps from our livestream and on Twitch.

Session types

Not everyone learns the same way, so we have multiple types of breakout content:

Birds of a Feather An interactive discussion with industry leaders about containers on AWS.

Breakout sessions 60-minute presentations about building on AWS. Sessions are delivered by both AWS experts and customers and span all content levels.

Workshops 2.5-hour, hands-on sessions that teach how to build on AWS. AWS credits are provided. Bring a laptop, and have an active AWS account.

Chalk Talks 1-hour, highly interactive sessions with a smaller audience. They begin with a short lecture delivered by an AWS expert, followed by a discussion with the audience.

Session levels

Whether you’re new to containers or you’ve been using them for years, you’ll find useful information at every level.

Introductory Sessions are focused on providing an overview of AWS services and features, with the assumption that attendees are new to the topic.

Advanced Sessions dive deeper into the selected topic. Presenters assume that the audience has some familiarity with the topic, but may or may not have direct experience implementing a similar solution.

Expert Sessions are for attendees who are deeply familiar with the topic, have implemented a solution on their own already, and are comfortable with how the technology works across multiple services, architectures, and implementations.

Session locations

All container sessions are located in the Aria Resort.

MONDAY 11/27

Breakout sessions

Level 200 (Introductory)

CON202 – Getting Started with Docker and Amazon ECSBy packaging software into standardized units, Docker gives code everything it needs to run, ensuring consistency from your laptop all the way into production. But once you have your code ready to ship, how do you run and scale it in the cloud? In this session, you become comfortable running containerized services in production using Amazon ECS. We cover container deployment, cluster management, service auto-scaling, service discovery, secrets management, logging, monitoring, security, and other core concepts. We also cover integrated AWS services and supplementary services that you can take advantage of to run and scale container-based services in the cloud.

Chalk talks

Level 200 (Introductory)

CON211 – Reducing your Compute Footprint with Containers and Amazon ECSTomas Riha, platform architect for Volvo, shows how Volvo transitioned its WirelessCar platform from using Amazon EC2 virtual machines to containers running on Amazon ECS, significantly reducing cost. Tomas dives deep into the architecture that Volvo used to achieve the migration in under four months, including Amazon ECS, Amazon ECR, Elastic Load Balancing, and AWS CloudFormation.

CON212 – Anomaly Detection Using Amazon ECS, AWS Lambda, and Amazon EMR Learn about the architecture that Cisco CloudLock uses to enable automated security and compliance checks throughout the entire development lifecycle, from the first line of code through runtime. It includes integration with IAM roles, Amazon VPC, and AWS KMS.

Level 400 (Expert)

CON410 – Advanced CICD with Amazon ECS Control PlaneMohit Gupta, product and engineering lead for Clever, demonstrates how to extend the Amazon ECS control plane to optimize management of container deployments and how the control plane can be broadly applied to take advantage of new AWS services. This includes ark—an AWS CLI-based deployment to Amazon ECS, Dapple—a slack-based automation system for deployments and notifications, and Kayvee—log and event routing libraries based on Amazon Kinesis.

Workshops

Level 200 (Introductory)

CON209 – Interstella 8888: Learn How to Use Docker on AWS Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. Join this workshop to get hands-on experience with Docker as you containerize Interstella 8888’s aging monolithic application and deploy it using Amazon ECS.

CON213 – Hands-on Deployment of Kubernetes on AWSIn this workshop, attendees get hands-on experience using Kubernetes and Kops (Kubernetes Operations), as described in our recent blog post. Attendees learn how to provision a cluster, assign role-based permissions and security, and launch a container. If you’re interested in learning best practices for running Kubernetes on AWS, don’t miss this workshop.

TUESDAY 11/28

Breakout Sessions

Level 200 (Introductory)

CON206 – Docker on AWS In this session, Docker Technical Staff Member Patrick Chanezon discusses how Finnish Rail, the national train system for Finland, is using Docker on Amazon Web Services to modernize their customer facing applications, from ticket sales to reservations. Patrick also shares the state of Docker development and adoption on AWS, including explaining the opportunities and implications of efforts such as Project Moby, Docker EE, and how developers can use and contribute to Docker projects.

CON208 – Building Microservices on AWS Increasingly, organizations are turning to microservices to help them empower autonomous teams, letting them innovate and ship software faster than ever before. But implementing a microservices architecture comes with a number of new challenges that need to be dealt with. Chief among these finding an appropriate platform to help manage a growing number of independently deployable services. In this session, Sam Newman, author of Building Microservices and a renowned expert in microservices strategy, discusses strategies for building scalable and robust microservices architectures. He also tells you how to choose the right platform for building microservices, and about common challenges and mistakes organizations make when they move to microservices architectures.

Level 300 (Advanced)

CON302 – Building a CICD Pipeline for Containers on AWS Containers can make it easier to scale applications in the cloud, but how do you set up your CICD workflow to automatically test and deploy code to containerized apps? In this session, we explore how developers can build effective CICD workflows to manage their containerized code deployments on AWS.

Ajit Zadgaonkar, Director of Engineering and Operations at Edmunds walks through best practices for CICD architectures used by his team to deploy containers. We also deep dive into topics such as how to create an accessible CICD platform and architect for safe blue/green deployments.

CON307 – Building Effective Container Images Sick of getting paged at 2am and wondering “where did all my disk space go?” New Docker users often start with a stock image in order to get up and running quickly, but this can cause problems as your application matures and scales. Creating efficient container images is important to maximize resources, and deliver critical security benefits.

In this session, AWS Sr. Technical Evangelist Abby Fuller covers how to create effective images to run containers in production. This includes an in-depth discussion of how Docker image layers work, things you should think about when creating your images, working with Amazon ECR, and mise-en-place for install dependencies. Prakash Janakiraman, Co-Founder and Chief Architect at Nextdoor discuss high-level and language-specific best practices for with building images and how Nextdoor uses these practices to successfully scale their containerized services with a small team.

CON309 – Containerized Machine Learning on AWS Image recognition is a field of deep learning that uses neural networks to recognize the subject and traits for a given image. In Japan, Cookpad uses Amazon ECS to run an image recognition platform on clusters of GPU-enabled EC2 instances. In this session, hear from Cookpad about the challenges they faced building and scaling this advanced, user-friendly service to ensure high-availability and low-latency for tens of millions of users.

CON320 – Monitoring, Logging, and Debugging for Containerized Services As containers become more embedded in the platform tools, debug tools, traces, and logs become increasingly important. Nare Hayrapetyan, Senior Software Engineer and Calvin French-Owen, Senior Technical Officer for Segment discuss the principals of monitoring and debugging containers and the tools Segment has implemented and built for logging, alerting, metric collection, and debugging of containerized services running on Amazon ECS.

Chalk Talks

Level 300 (Advanced)

CON314 – Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS Containers make it easy to deploy new code into production to update the functionality of a service, but what happens when you need to update the Amazon EC2 compute instances that your containers are running on? In this talk, we’ll deep dive into how to upgrade the Amazon EC2 infrastructure underlying a live production Amazon ECS cluster without affecting service availability. Matt Callanan, Engineering Manager at Expedia walk through Expedia’s “PRISM” project that safely relocates hundreds of tasks onto new Amazon EC2 instances with zero-downtime to applications.

CON322 – Maximizing Amazon ECS for Large-Scale Workloads Head of Mobfox DevOps, David Spitzer, shows how Mobfox used Docker and Amazon ECS to scale the Mobfox services and development teams to achieve low-latency networking and automatic scaling. This session covers Mobfox’s ecosystem architecture. It compares 2015 and today, the challenges Mobfox faced in growing their platform, and how they overcame them.

CON323 – Microservices Architectures for the Enterprise Salva Jung, Principle Engineer for Samsung Mobile shares how Samsung Connect is architected as microservices running on Amazon ECS to securely, stably, and efficiently handle requests from millions of mobile and IoT devices around the world.

CON324 – Windows Containers on Amazon ECS Docker containers are commonly regarded as powerful and portable runtime environments for Linux code, but Docker also offers API and toolchain support for running Windows Servers in containers. In this talk, we discuss the various options for running windows-based applications in containers on AWS.

CON326 – Remote Sensing and Image Processing on AWS Learn how Encirca services by DuPont Pioneer uses Amazon ECS powered by GPU-instances and Amazon EC2 Spot Instances to run proprietary image-processing algorithms against satellite imagery. Mark Lanning and Ethan Harstad, engineers at DuPont Pioneer show how this architecture has allowed them to process satellite imagery multiple times a day for each agricultural field in the United States in order to identify crop health changes.

Workshops

Level 300 (Advanced)

CON317 – Advanced Container Management at Catsndogs.lol Catsndogs.lol is a (fictional) company that needs help deploying and scaling its container-based application. During this workshop, attendees join the new DevOps team at CatsnDogs.lol, and help the company to manage their applications using Amazon ECS, and help release new features to make our customers happier than ever.Attendees get hands-on with service and container-instance auto-scaling, spot-fleet integration, container placement strategies, service discovery, secrets management with AWS Systems Manager Parameter Store, time-based and event-based scheduling, and automated deployment pipelines. If you are a developer interested in learning more about how Amazon ECS can accelerate your application development and deployment workflows, or if you are a systems administrator or DevOps person interested in understanding how Amazon ECS can simplify the operational model associated with running containers at scale, then this workshop is for you. You should have basic familiarity with Amazon ECS, Amazon EC2, and IAM.

Additional requirements:

The AWS CLI or AWS Tools for PowerShell installed

An AWS account with administrative permissions (including the ability to create IAM roles and policies) created at least 24 hours in advance.

WEDNESDAY 11/29

Birds of a Feather (BoF)

CON01 – Birds of a Feather: Containers and Open Source at AWS Cloud native architectures take advantage of on-demand delivery, global deployment, elasticity, and higher-level services to enable developer productivity and business agility. Open source is a core part of making cloud native possible for everyone. In this session, we welcome thought leaders from the CNCF, Docker, and AWS to discuss the cloud’s direction for growth and enablement of the open source community. We also discuss how AWS is integrating open source code into its container services and its contributions to open source projects.

Breakout Sessions

Level 300 (Advanced)

CON308 – Mastering Kubernetes on AWS Much progress has been made on how to bootstrap a cluster since Kubernetes’ first commit and is now only a matter of minutes to go from zero to a running cluster on Amazon Web Services. However, evolving a simple Kubernetes architecture to be ready for production in a large enterprise can quickly become overwhelming with options for configuration and customization.

In this session, Arun Gupta, Open Source Strategist for AWS and Raffaele Di Fazio, software engineer at leading European fashion platform Zalando, show the common practices for running Kubernetes on AWS and share insights from experience in operating tens of Kubernetes clusters in production on AWS. We cover options and recommendations on how to install and manage clusters, configure high availability, perform rolling upgrades and handle disaster recovery, as well as continuous integration and deployment of applications, logging, and security.

CON310 – Moving to Containers: Building with Docker and Amazon ECS If you’ve ever considered moving part of your application stack to containers, don’t miss this session. We cover best practices for containerizing your code, implementing automated service scaling and monitoring, and setting up automated CI/CD pipelines with fail-safe deployments. Manjeeva Silva and Thilina Gunasinghe show how McDonalds implemented their home delivery platform in four months using Docker containers and Amazon ECS to serve tens of thousands of customers.

Level 400 (Expert)

CON402 – Advanced Patterns in Microservices Implementation with Amazon ECS Scaling a microservice-based infrastructure can be challenging in terms of both technical implementation and developer workflow. In this talk, AWS Solutions Architect Pierre Steckmeyer is joined by Will McCutchen, Architect at BuzzFeed, to discuss Amazon ECS as a platform for building a robust infrastructure for microservices. We look at the key attributes of microservice architectures and how Amazon ECS supports these requirements in production, from configuration to sophisticated workload scheduling to networking capabilities to resource optimization. We also examine what it takes to build an end-to-end platform on top of the wider AWS ecosystem, and what it’s like to migrate a large engineering organization from a monolithic approach to microservices.

CON404 – Deep Dive into Container Scheduling with Amazon ECS As your application’s infrastructure grows and scales, well-managed container scheduling is critical to ensuring high availability and resource optimization. In this session, we deep dive into the challenges and opportunities around container scheduling, as well as the different tools available within Amazon ECS and AWS to carry out efficient container scheduling. We discuss patterns for container scheduling available with Amazon ECS, the Blox scheduling framework, and how you can customize and integrate third-party scheduler frameworks to manage container scheduling on Amazon ECS.

Chalk Talks

Level 300 (Advanced)

CON312 – Building a Selenium Fleet on the Cheap with Amazon ECS with Spot Fleet Roberto Rivera and Matthew Wedgwood, engineers at RetailMeNot, give a practical overview of setting up a fleet of Selenium nodes running on Amazon ECS with Spot Fleet. Discuss the challenges of running Selenium with high availability at minimum cost using Amazon ECS container introspection to connect the Selenium Hub with its nodes.

CON315 – Virtually There: Building a Render Farm with Amazon ECS Learn how 8i Corp scales its multi-tenanted, volumetric render farm up to thousands of instances using AWS, Docker, and an API-driven infrastructure. This render farm enables them to turn the video footage from an array of synchronized cameras into a photo-realistic hologram capable of playback on a range of devices, from mobile phones to high-end head mounted displays. Join Owen Evans, VP of Engineering for 8i, as they dive deep into how 8i’s rendering infrastructure is built and maintained by just a handful of people and powered by Amazon ECS.

CON325 – Developing Microservices – from Your Laptop to the Cloud Wesley Chow, Staff Engineer at Adroll, shows how his team extends Amazon ECS by enabling local development capabilities. Hologram, Adroll’s local development program, brings the capabilities of the Amazon EC2 instance metadata service to non-EC2 hosts, so that developers can run the same software on local machines with the same credentials source as in production.

CON327 – Patterns and Considerations for Service Discovery Roven Drabo, head of cloud operations at Kaplan Test Prep, illustrates Kaplan’s complete container automation solution using Amazon ECS along with how his team uses NGINX and HashiCorp Consul to provide an automated approach to service discovery and container provisioning.

CON328 – Building a Development Platform on Amazon ECS Quinton Anderson, Head of Engineering for Commonwealth Bank of Australia, walks through how they migrated their internal development and deployment platform from Mesos/Marathon to Amazon ECS. The platform uses a custom DSL to abstract a layered application architecture, in a way that makes it easy to plug or replace new implementations into each layer in the stack.

Workshops

Level 300 (Advanced)

CON318 – Interstella 8888: Monolith to Microservices with Amazon ECS Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. Join this workshop to get hands-on experience deploying Docker containers as you break Interstella 8888’s aging monolithic application into containerized microservices. Using Amazon ECS and an Application Load Balancer, you create API-based microservices and deploy them leveraging integrations with other AWS services.

CON332 – Build a Java Spring Application on Amazon ECS This workshop teaches you how to lift and shift existing Spring and Spring Cloud applications onto the AWS platform. Learn how to build a Spring application container, understand bootstrap secrets, push container images to Amazon ECR, and deploy the application to Amazon ECS. Then, learn how to configure the deployment for production.

THURSDAY 11/30

Breakout Sessions

Level 200 (Introductory)

CON201 – Containers on AWS – State of the Union Just over four years after the first public release of Docker, and three years to the day after the launch of Amazon ECS, the use of containers has surged to run a significant percentage of production workloads at startups and enterprise organizations. Join Deepak Singh, General Manager of Amazon Container Services, as he covers the state of containerized application development and deployment trends, new container capabilities on AWS that are available now, options for running containerized applications on AWS, and how AWS customers successfully run container workloads in production.

Level 300 (Advanced)

CON304 – Batch Processing with Containers on AWS Batch processing is useful to analyze large amounts of data. But configuring and scaling a cluster of virtual machines to process complex batch jobs can be difficult. In this talk, we show how to use containers on AWS for batch processing jobs that can scale quickly and cost-effectively. We also discuss AWS Batch, our fully managed batch-processing service. You also hear from GoPro and Here about how they use AWS to run batch processing jobs at scale including best practices for ensuring efficient scheduling, fine-grained monitoring, compute resource automatic scaling, and security for your batch jobs.

Level 400 (Expert)

CON406 – Architecting Container Infrastructure for Security and Compliance While organizations gain agility and scalability when they migrate to containers and microservices, they also benefit from compliance and security, advantages that are often overlooked. In this session, Kelvin Zhu, lead software engineer at Okta, joins Mitch Beaumont, enterprise solutions architect at AWS, to discuss security best practices for containerized infrastructure. Learn how Okta built their development workflow with an emphasis on security through testing and automation. Dive deep into how containers enable automated security and compliance checks throughout the development lifecycle. Also understand best practices for implementing AWS security and secrets management services for any containerized service architecture.

Chalk Talks

Level 300 (Advanced)

CON329 – Full Software Lifecycle Management for Containers Running on Amazon ECS Learn how The Washington Post uses Amazon ECS to run Arc Publishing, a digital journalism platform that powers The Washington Post and a growing number of major media websites. Amazon ECS enabled The Washington Post to containerize their existing microservices architecture, avoiding a complete rewrite that would have delayed the platform’s launch by several years. In this session, Jason Bartz, Technical Architect at The Washington Post, discusses the platform’s architecture. He addresses the challenges of optimizing Arc Publishing’s workload, and managing the application lifecycle to support 2,000 containers running on more than 50 Amazon ECS clusters.

CON330 – Running Containerized HIPAA Workloads on AWS Nihar Pasala, Engineer at Aetion, discusses the Aetion Evidence Platform, a system for generating the real-world evidence used by healthcare decision makers to implement value-based care. This session discusses the architecture Aetion uses to run HIPAA workloads using containers on Amazon ECS, best practices, and learnings.

Level 400 (Expert)

CON408 – Building a Machine Learning Platform Using Containers on AWS DeepLearni.ng develops and implements machine learning models for complex enterprise applications. In this session, Thomas Rogers, Engineer for DeepLearni.ng discusses how they worked with Scotiabank to leverage Amazon ECS, Amazon ECR, Docker, GPU-accelerated Amazon EC2 instances, and TensorFlow to develop a retail risk model that helps manage payment collections for millions of Canadian credit card customers.

Workshops

Level 300 (Advanced)

CON319 – Interstella 8888: CICD for Containers on AWS Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. Join this workshop to learn how to set up a CI/CD pipeline for containerized microservices. You get hands-on experience deploying Docker container images using Amazon ECS, AWS CloudFormation, AWS CodeBuild, and AWS CodePipeline, automating everything from code check-in to production.

FRIDAY 12/1

Breakout Sessions

Level 400 (Expert)

CON405 – Moving to Amazon ECS – the Not-So-Obvious Benefits If you ask 10 teams why they migrated to containers, you will likely get answers like ‘developer productivity’, ‘cost reduction’, and ‘faster scaling’. But teams often find there are several other ‘hidden’ benefits to using containers for their services. In this talk, Franziska Schmidt, Platform Engineer at Mapbox and Yaniv Donenfeld from AWS will discuss the obvious, and not so obvious benefits of moving to containerized architecture. These include using Docker and Amazon ECS to achieve shared libraries for dev teams, separating private infrastructure from shareable code, and making it easier for non-ops engineers to run services.

Chalk Talks

Level 300 (Advanced)

CON331 – Deploying a Regulated Payments Application on Amazon ECS Travelex discusses how they built an FCA-compliant international payments service using a microservices architecture on AWS. This chalk talk covers the challenges of designing and operating an Amazon ECS-based PaaS in a regulated environment using a DevOps model.

Workshops

Level 400 (Expert)

CON407 – Interstella 8888: Advanced Microservice Operations Interstella 8888 is an intergalactic trading company that deals in rare resources, but their antiquated monolithic logistics systems are causing the business to lose money. In this workshop, you help Interstella 8888 build a modern microservices-based logistics system to save the company from financial ruin. We give you the hands-on experience you need to run microservices in the real world. This includes implementing advanced container scheduling and scaling to deal with variable service requests, implementing a service mesh, issue tracing with AWS X-Ray, container and instance-level logging with Amazon CloudWatch, and load testing.

Know before you go

Want to brush up on your container knowledge before re:Invent? Here are some helpful resources to get started:

While most cracking groups operate under a veil of secrecy, China-based 3DM is not shy to come out in public.

The group’s leader, known as Bird Sister, has commented on various gaming and piracy related issues in the past.

She also spoke out when her own group was sued by the Japanese game manufacturer Koei Tecmo last year. The company accused 3DM of pirating several of its titles, including Romance of the Three Kingdoms.

However, Bird Sister instead wondered why the company should be able to profit from a work inspired by a 3rd-century novel from China.

“…why does a Japanese company, Koei have the copyright of this game when the game is obviously a derivation from the book “Romance of the Three Kingdoms” written by Chen Shou. I think Chinese gaming companies should try taking back the copyright,” she said.

Bird Sister

The novel in question has long since been in the public domain so there’s nothing stopping Koei Tecmo from using it, as Kotaku points out. The game, however, is a copyrighted work and 3DM’s actions were seen as clear copyright infringement by a Chinese court.

In a press release, Koei Tecmo announces that it has won its lawsuit against the cracking group.

The court ordered 3DM to stop distributing the infringing games and awarded a total of 1.62 million Yuan ($245,000) in piracy damages and legal fees.

While computer games are cracked and pirated on a daily basis, those responsible for it are rarely held accountable. This makes the case against 3DM rather unique. And it may not be the last if it’s up to the game manufacturer.

“We will continue to respond rigorously to infringements of our copyrights and trademark rights, both in domestic and overseas markets, while also developing satisfying games that many users can enjoy,” said the company, commenting on the ruling.

While the lawsuit may help to steer the cracking group away from pirating Koei Tecmo games, it can’t undo any earlier releases. Court order or not, past 3DM releases, including Romance of the Three Kingdoms titles, are still widely available through third-party sites.

The AWS Community Heroes program helps shine a spotlight on some of the innovative work being done by rockstar AWS developers around the globe. Marrying cloud expertise with a passion for community building and education, these heroes share their time and knowledge across social media and through in-person events. Heroes also actively help drive community-led tracks at conferences. At this year’s re:Invent, many Heroes will be speaking during the Monday Community Day track.

This November, we are thrilled to have four Heroes joining our network of cloud innovators. Without further ado, meet to our newest AWS Community Heroes!

At OSAM, Anh and his enthusiastic team have helped many companies, from SMBs to Enterprises, move to the cloud with AWS. They offer a wide range of services, including migration, consultation, architecture, and solution design on AWS. Anh’s vision for OSAM is beyond a cloud service provider; the company will take part in building a complete AWS ecosystem in Vietnam, where other companies are encouraged to become AWS partners through training and collaboration activities.

In 2016, Anh founded the AWS Vietnam User Group as a channel to share knowledge and hands-on experience among cloud practitioners. Since then, the community has reached more than 4,800 members and is still expanding. The group holds monthly meetups, connects many SMEs to AWS experts, and provides real-time, free-of-charge consultancy to startups. In August 2017, Anh joined as lead content creator of a program called “Cloud Computing Lectures for Universities” which includes translating AWS documentation & news into Vietnamese, providing students with fundamental, up-to-date knowledge of AWS cloud computing, and supporting students’ career paths.

Thorsten Höger is CEO and Cloud consultant at Taimos, where he is advising customers on how to use AWS. Being a developer, he focuses on improving development processes and automating everything to build efficient deployment pipelines for customers of all sizes.

Before being self-employed, Thorsten worked as a developer and CTO of Germany’s first private bank running on AWS. With his colleagues, he migrated the core banking system to the AWS platform in 2013. Since then he organizes the AWS user group in Stuttgart and is a frequent speaker at Meetups, BarCamps, and other community events.

As a supporter of open source software, Thorsten is maintaining or contributing to several projects on Github, like test frameworks for AWS Lambda, Amazon Alexa, or developer tools for CloudFormation. He is also the maintainer of the Jenkins AWS Pipeline plugin.

Yu Zhang (Becky Zhang) is COO of BootDev, which focuses on Big Data solutions on AWS and high concurrency web architecture. Before she helped run BootDev, she was working at Yubis IT Solutions as an operations manager.

Becky plays a key role in the AWS User Group Shanghai (AWSUGSH), regularly organizing AWS UG events including AWS Tech Meetups and happy hours, gathering AWS talent together to communicate the latest technology and AWS services. As a female in technology industry, Becky is keen on promoting Women in Tech and encourages more woman to get involved in the community.

Becky also connects the China AWS User Group with user groups in other regions, including Korea, Japan, and Thailand. She was invited as a panelist at AWS re:Invent 2016 and spoke at the Seoul AWS Summit this April to introduce AWS User Group Shanghai and communicate with other AWS User Groups around the world.

Besides events, Becky also promotes the Shanghai AWS User Group by posting AWS-related tech articles, event forecasts, and event reports to Weibo, Twitter, Meetup.com, and WeChat (which now has over 2000 official account followers).

Nilesh Vaghela is the founder of ElectroMech Corporation, an AWS Cloud and open source focused company (the company started as an open source motto). Nilesh has been very active in the Linux community since 1998. He started working with AWS Cloud technologies in 2013 and in 2014 he trained a dedicated cloud team and started full support of AWS cloud services as an AWS Standard Consulting Partner. He always works to establish and encourage cloud and open source communities.

He started the AWS Meetup community in Ahmedabad in 2014 and as of now 12 Meetups have been conducted, focusing on various AWS technologies. The Meetup has quickly grown to include over 2000 members. Nilesh also created a Facebook group for AWS enthusiasts in Ahmedabad, with over 1500 members.

Apart from the AWS Meetup, Nilesh has delivered a number of seminars, workshops, and talks around AWS introduction and awareness, at various organizations, as well as at colleges and universities. He has also been active in working with startups, presenting AWS services overviews and discussing how startups can benefit the most from using AWS services.

Nine years ago I showed you how you could Distribute Your Content with Amazon CloudFront. We launched CloudFront in 2008 with 14 Points of Presence and have been expanding rapidly ever since. Today I am pleased to announce the opening of our 100th Point of Presence, the fifth one in Tokyo and the sixth in Japan. With 89 Edge Locations and 11 Regional Edge Caches, CloudFront now supports traffic generated by millions of viewers around the world.

23 Countries, 50 Cities, and Growing Those 100 Points of Presence span the globe, with sites in 50 cities and 23 countries. In the past 12 months we have expanded the size of our network by about 58%, adding 37 Points of Presence, including nine in the following new cities:

Berlin, Germany

Minneapolis, Minnesota, USA

Prague, Czech Republic

Boston, Massachusetts, USA

Munich, Germany

Vienna, Austria

Kuala Lumpur, Malaysia

Philadelphia, Pennsylvania, USA

Zurich, Switzerland

We have even more in the works, including an Edge Location in the United Arab Emirates, currently planned for the first quarter of 2018.

Innovating for Our Customers As I mentioned earlier, our network consists of a mix of Edge Locations and Regional Edge Caches. First announced at re:Invent 2016, the Regional Edge Caches sit between our Edge Locations and your origin servers, have even more memory than the Edge Locations, and allow us to store content close to the viewers for rapid delivery, all while reducing the load on the origin servers.

While locations are important, they are just a starting point. We continue to focus on security with the recent launch of our Security Policies feature and our announcement that CloudFront is a HIPAA-eligible service. We gave you more content-serving and content-generation options with the launch of[email protected], letting you run AWS Lambda functions close to your users.

We have also been working to accelerate the processing of cache invalidations and configuration changes. We now accept invalidations within milliseconds of the request and confirm that the request has been processed world-wide, typically within 60 seconds. This helps to ensure that your customers have access to fresh, timely content!

If you cannot attend AWS re:Invent 2017 in person, you can still watch the two keynotes and Tuesday Night Live from wherever you are. We will live stream both keynotes with Andy Jassy, CEO of Amazon Web Services, and Werner Vogels, CTO of Amazon.com, as well as Tuesday Night Live with Peter DeSantis, VP of AWS Global Infrastructure. Note that the live streams will be in English only. The recordings will include captions for Japanese, Korean, and Simplified Chinese.

At some point, many of you will have become exasperated with your AI personal assistant for not understanding you due to your accent – or worse, your fantastic regional dialect! A vending machine from Coca-Cola Sweden turns this issue inside out: the Dialekt-o-maten rewards users with a free soft drink for speaking in a Swedish regional dialect.

Thirsty fans along with journalists were invited to try Dialekt-o-maten at Stureplan in central Stockholm. Depending on how well they could pronounce the different phrases in assorted Swedish dialects – they were rewarded an ice cold Coke with that destination on the label.

The Dialekt-o-maten

The machine, which uses a Raspberry Pi, was set up in Stureplan Square in Stockholm. A person presses one of six buttons to choose the regional dialect they want to try out. They then hit ‘record’, and speak into the microphone. The recording is compared to a library of dialect samples, and, if it matches closely enough, voila! — the Dialekt-o-maten dispenses a soft drink for free.

Code for the Dialekt-o-maten

The team of developers used the dejavu Python library, as well as custom-written code which responded to new recordings. Carl-Anders Svedberg, one of the developers, said:

Testing the voices and fine-tuning the right level of difficulty for the users was quite tricky. And we really should have had more voice samples. Filtering out noise from the surroundings, like cars and music, was also a small hurdle.

While they wrote the initial software on macOS, the team transferred it to a Raspberry Pi so they could install the hardware inside the Dialekt-o-maten.

Regional dialects

Even though Sweden has only ten million inhabitants, there are more than 100 Swedish dialects. In some areas of Sweden, the local language even still resembles Old Norse. The Dialekt-o-maten recorded how well people spoke the six dialects it used. Apparently, the hardest one to imitate is spoken in Vadstena, and the easiest is spoken in Smögen.

Speech recognition with the Pi

Because of its audio input capabilities, the Raspberry Pi is very useful for building devices that use speech recognition software. One of our favourite projects in this vein is of course Allen Pan’s Real-Life Wizard Duel. We also think this pronunciation training machine by Japanese makers HomeMadeGarbage is really neat. Ideas from these projects and the Dialekt-o-maten could potentially be combined to make a fully fledged language-learning tool!

How about you? Have you used a Raspberry Pi to help you become multilingual? If so, do share your project with us in the comments or via social media.

While this hasn’t been a great summer at the box office, the makers of the film can’t complain as they’ve taken the top spot two weeks in a row. While this is reason for a small celebration, the fun didn’t last for long.

A few days ago several high-quality copies of the film started to appear on various pirate sites. While movie leaks happen every day, it’s very unusual that it happens just a few days after the theatrical release. In several countries including Australia, China, and Germany, it hasn’t even premiered yet.

Many pirates appear to be genuinely surprised by the early release as well, based on various comments. “August 18 was the premiere, how did you do this magic?” one downloader writes.

“OK, this was nothing short of perfection. 8 days post theatrical release… perfect 1080p clarity… no hardcoded subs… English translation AND full English subs… 5.1 audio. Does it get any better?” another commenter added.

The pirated copies of the movie are tagged as a “Web-DL” which means that they were ripped from an online streaming service. While the source is not revealed anywhere, the movie is currently available on Netflix in Japan, which makes it a likely candidate.

Screenshot of the leak

While the public often call for a simultaneous theatrical and Internet release, the current leak shows that this might come with a significant risk.

It’s clear that The Hitman’s Bodyguard production company Millennium Films is going to be outraged. The company has taken an aggressive stance against piracy in recent years. Among other things, it demanded automated cash settlements from alleged BitTorrent pirates and is also linked to various ‘copyright troll’ lawsuits.

Whether downloaders of The Hitman’s Bodyguard will be pursued as well has yet to be seen. For now, there is still plenty of interest from pirates. The movie was the most downloaded title on BitTorrent last week and is still doing well.

Using a Raspberry Pi, an Arduino, an Adafruit NeoPixel Ring and a servomotor, Japanese makers HomeMadeGarbage produced this Pronunciation Training Machine to help their parents distinguish ‘L’s and ‘R’s when speaking English.

How does an Pronunciation Training Machine work?

As you can see in the video above, the machine utilises the Google Cloud Speech API to recognise their parents’ pronunciation of the words ‘right’ and ‘light’. Correctly pronounce the former, and the servo-mounted arrow points to the right. Pronounce the later and the NeoPixel Ring illuminates because, well, you just said “light”.

Or use random.choice to switch on LEDs over certain images, and speech recognition to reward a correct answer? Light up a picture of a cat, for example, and when the player says “cat”, they receive a ‘purr’ or a treat?

Obligatory kitten pictureimage c/o somewhere on the internet!

Raspberry Pi-based educational aids do not have to be elaborate builds. They can use components as simple as a servo and an LED, and still have the potential to make great improvements in people’s day-to-day lives.

Your own projects

If you’ve created an educational tool using a Raspberry Pi, we’d love to see it. The Raspberry Pi itself is an educational tool, so you’re helping it to fulfil its destiny! Make sure you share your projects with us on social media, or pop a link in the comments below. We’d also love to see people using the Pronunciation Training Machine (or similar projects), so make sure you share those too!

A massive shout out to Artie at hackster.io for this heads-up, and for all the other Raspberry Pi projects he sends my way. What a star!

A kind anonymous patron offers this prompt, which I totally fucked up getting done in July:

Something to do with programming languages? Alternatively, interesting game mechanics!

It’s been a while since I’ve written a thing about programming languages, eh? But I feel like I’ve run low on interesting things to say about them. And I just did that level design article, which already touched on some interesting game mechanics… oh dear.

Okay, how about this. It’s something I’ve been neck-deep in for quite some time, and most of the knowledge is squirrelled away in obscure wikis and ancient forum threads: getting data out of Pokémon games. I think that preserves the spirit of your two options, since it’s sort of nestled in a dark corner between how programming languages work and how game mechanics are implemented.

In the grand scheme of things, I don’t know all that much about this. I know more than people who’ve never looked into it at all, which I suppose is most people — but there are also people who basically do this stuff full-time, and that experience is crucial since so much of this work comes down to noticing patterns. While it sure helped to have a technical background, I wouldn’t have gotten anywhere at all if I weren’t acquainted with a few people who actually know what they’re doing. Most of what I’ve done is take their work and run with it.

Also, I am not a lawyer and cannot comment on any legal questions here. Is it okay to download ROMs of games you own? Is it okay to dump ROMs yourself if you have the hardware? Does this count as reverse engineering, and do the DMCA protections apply? I have no idea. But that said, it’s not exactly hard to find ROM hacking communities, and there’s no way Nintendo isn’t aware of them (or of the fact that every single Pokémon fansite gets their info from ROMs), so I suspect Nintendo simply doesn’t care unless something risks going mainstream — and thus putting a tangible dent in the market for their own franchise.

Still, I don’t want to direct an angry legal laser at anyone, so I’m going to be a bit selective about what resources I link to and what I merely allude to the existence of.

This is, necessarily, a pretty technical topic. It starts out in binary data and spirals down into microscopic details that even most programmers don’t need to care about. Sometimes people approach me to ask how they can help with this work, and all I can do is imagine the entire contents of this post and shrug helplessly.

Still, as usual, I’ll do my best to make this accessible without also making it a 500-page introduction to all of computing. Here is some helpful background stuff that would be clumsy to cram into the rest of the post.

Computers deal in bytes. Pop culture likes to depict computers as working in binary (individual bits or “binary digits”), which is technically true down on the level of the circuitry, but virtually none of the actual logic in a computer cares about individual bits. In fact, computers can’t access individual bits directly; they can only fetch bytes, then extract bits from those bytes as a separate step.

A byte is made of eight bits, which gives it 2⁸ or 256 possible values. It’s helpful to see what a byte is, but writing them in decimal is a bit clumsy, and writing them in binary is impossible to read. A clever compromise is to write them in hexadecimal, base sixteen, where the digits run from 0 to 9 and then A to F. Because sixteen is 2⁴, one hex digit is exactly four binary digits, and so a byte can conveniently be written as exactly two hex digits.

(A running theme across all of this is that many of the choices are arbitrary; in different times or places, other choices may have been made. Most likely, other choices were made, and they’re still in use somewhere. Virtually the only reliable constant is that any computer you will ever encounter will have bytes made out of eight bits. But even that wasn’t always the case.)

The Unix program xxd will print out bytes in a somewhat readable way. Here’s its output for a short chunk of English text.

Each line shows sixteen bytes. The left column shows the position (or “offset”) in the data, in hex. (It starts at zero, because programmers like to start counting at zero; it makes various things easier.) The middle column shows the bytes themselves, written as two hex digits each, with just enough space that you can tell where the boundaries between bytes are. The right column shows the bytes interpreted as ASCII text, with any non-characters replaced with a ..

(ASCII is a character encoding, a way to represent text as bytes — which are only numbers — by listing a set of characters in some order and then assigning numbers to them. Text crops up in a lot of formats, and this makes it easy to spot at a glance. Alas, ASCII is only one of many schemes, and it only really works for English text, but it’s the most common character encoding by far and has some overlap with the runners-up as well.)

Since everything is made out of bytes, there are an awful lot of schemes for how to express various kinds of information as bytes. As a result, a byte is meaningless on its own; it only has meaning when something else interprets it. It might be a plain number ranging from 0 to 255; it might be a plain number ranging from −128 to 127; it might be part of a bigger number that spans multiple bytes; it might be several small numbers crammed into one byte; it might be part of a color value; it might be a letter.

A meaningful arrangement for a whole sequence of bytes is loosely referred to as a format. If it’s intended for an entire file, it’s a file format. A file containing only bytes that are intended as text is called a plain text file (or format); this is in contrast to a binary file, which is basically anything else.

Some file formats are very common and well-understood, like PNG or MP3. Some are very common but were invented behind closed doors, like Photoshop’s PSD, so they’ve had to be reverse engineered for other software to be able to read and write them. And a great many file formats are obscure and ad hoc, invented only for use by one piece of software. Programmers invent file formats all the time.

Reverse engineering a format is largely a matter of identifying common patterns and finding data that’s expected to be present somewhere. Of course, in cases like Photoshop’s PSD, the most productive approach is to make small changes to a file in Photoshop and then see what changed in the resulting PSD. That’s not always an option — say, if you’re working with a game for a handheld that won’t let you easily run modified games.

Okay, hopefully that’s enough of that and you can pick up the rest along the way!

Before Diamond and Pearl, all of veekun’s data was just copied from other sources. Like, when I was in high school, I would spend lunch in the computer lab meticulously copy/pasting the Gold and Silver Pokédex text from another website into mine. Hey, I started the thing when I was 12.

But then… something happened. I can’t remember what it was, which makes this a much less compelling story. I assume veekun got popular enough that a couple other Pokénerds found out about it and started hanging around. Then when Diamond and Pearl came out, they started digging into the games, and I thought that was super interesting, so I did it too.

This is what led veekun into being much more about ripped data, though its track record has been… bumpy.

Everything in a computer is, on some level, a sequence of bytes. Game consoles and handhelds, being computers, also deal in bytes. A game cartridge is just a custom disk, and a ROM is a file containing all the bytes on that disk. (It’s a specific case of a disk image, like an ISO is for CDs and DVDs. You can take a disk image of a hard drive or a floppy disk or anything else, too; they’re all just bytes.)

But what are those bytes? That’s the fundamental and pervasive question. In the case of a Nintendo DS cartridge, the first thing I learned was that they’re arranged in a filesystem. Most disks have a filesystem — it’s like a table of contents for the disk, explaining how the one single block of bytes is divided into named files.

That is fantastically useful, and I didn’t even have to figure out how it works, because other people already had. Let’s have a look at it, because seeing binary formats is the best way to get an idea of how they might be designed. Here’s the beginning of the English version of Pokémon Diamond.

How do we make sense of this? Let us consult the little tool I started writing for this, porigon-z. It’s abandoned and unfinished and not terribly well-written; I would just link to the documentation I consulted when writing this, but it’s conspicuously 404ing now, so this’ll have to do. I described the format using an old version of the Construct binary format parsing library, and it looks like this:

A String is text of a fixed length, either truncated or padded with NULs (character zero) to fit. The clumsy ULInt16 means an Unsigned, Little-endian, 16-bit (two byte) integer.

(What does little-endian mean? I’m glad you asked! When a number spans multiple bytes, there’s a choice to be made: what order do those bytes go in? The way we write numbers is big-endian, where the biggest part appears first; but most computers are little-endian, putting the smallest part first. That means a number like 0x1234 is actually stored in two bytes as 34 12.)

Alas, this is a terrible example, since most of this is goofy internal stuff we don’t actually care about. The interesting bit is the “file table”. A little ways down my description of the format is this block of ULInt32s, which start at position 0x40 in the file.

I included one previous line for context; starting right after a whole bunch of ffs or 00s is a pretty good sign, since those are likely to be junk used to fill space. So we’re probably in the right place, or at least a right place. Also we’re definitely in the right place since I already know porigon-z works, but, you know.

The beginning part of this is a bunch of numbers that start out relatively low and gradually get bigger. That’s a pretty good indication of an offset table — a list of “where this thing starts” and “how long it is”, just like the offset/length pairs that pointed us here in the first place. The only difference here is that we have a whole bunch of them. And porigon-z confirms that this is a list of:

My code does a bit more than this, but I don’t want this post to be about the intricacies of an old version of Construct. The short version is that each entry is eight bytes long and corresponds to a directory; this list actually describes the directory tree. Decoding the first few produces:

Again, we encounter some mild weirdness. The parent ids seem to count upwards, except for the first one, and where did that f come from? It turns out that for the first record only — which is the root directory and therefore has no parent — the parent id is actually the total number of records to read. So there are 0x0045 or 69 records here. As for the f, well, I have no idea! I just discard it entirely when linking directories together.

So let’s fully decode entry 3 (the fourth one, since we started at zero). It has offset 0x000002fd, which is relative to where the table starts, so we need to add that to 0x00336400 to get 0x003366fd. We don’t have a length, but starting from there we see:

I called the structure here a filename_list_struct. Also, as I read this code, I really wish I’d made it more sensible; sorry, I guess I’ll clean it up when I get around to re-ripping gen 4. The Construct code is a bit goofy, but the idea is:

Read a byte. If it’s zero, stop here. Otherwise, the top bit is a flag indicating whether this entry is a directory; the rest is a length.

The next length bytes are the filename.

Iff this is a directory, the next two bytes are the directory id.

Repeat.

(Ah yes, bits and flags. A flag is something that can only be true or false, so it really only needs one bit to store. So programmers like to cram flags into the same byte as other stuff to save space. Computers can’t examine individual bits directly, but it’s easy to manipulate them from code with a little math. Of course, using 1 bit for a flag means only 7 are left for the length, so it’s limited to 127 instead of 255.)

Let’s try this. The first byte is 0c. I can tell you right away that the top bit is zero; if the top bit is one, then the first hex digit will be 8 or greater. So this is just a file, and it’s 0c or 12 bytes long. The next twelve bytes are cb_data.narc, so that’s the filename. Repeat from the beginning: the next byte is 00, which is zero, so we’re done. This directory only contains a single file, cb_data.narc.

But wait, what is this directory? We know its id is 3; its name would appear somewhere in the filename list for its parent directory, 2, along with an extra two bytes indicating it matches to directory 3. To get the name for directory 2, we’d consult directory 1; and directory 1’s parent is directory 0. Directory 0 is the root, which is just / and has no name, so at that point we’re done. Of course, if we read all these filename lists in order rather than skipping straight to the third one, then we’d have already seen all these names and wouldn’t have to look them up.

One final question: where’s the data? All we have are filenames. It turns out the data is in a totally separate table at fat_offset — “FAT” is short for “file allocation table”. That’s a vastly simpler list of pairs of start offset and end offset, giving the positions of the individual files, and nothing else.

All we have to do is match up the filenames to those offset pairs. This is where the “top file id” comes in: it’s the id of the first file in the directory, and the others count up from there. This directory’s top file id is 0x57, so cb_data.narc has file id 0x57. (If there were a next file, it would have id 0x58, and so on.) Its data is given by the 0x57th (87th) pair of offsets.

Phew! We haven’t even gotten anywhere yet. But this is important for figuring out where anything even is. And you don’t have to do it by hand, since I wrote a program to do it. Run:

Hey, it’s our friend cb_data.narc, with its full path! On the left is its file id, 87. Next are its start and end offsets, followed by its filesize.

You may notice that before the filenames start, you’ll get a list of unnamed files. These are entries in the FAT that have no corresponding filename. I learned only recently that they’re code — overlays, in fact, though I don’t know what that means yet.

This was fantastic. All the game data, nearly arranged into files, and even named sensibly for us. A goldmine. It didn’t used to be so easy, as we will see later.

Other people had already noticed the file /poketool/personal/personal.narc contains much of the base data about Pokémon. You’ll notice it has a “501” in brackets next to it, indicating that it’s actually a NARC file — a “Nitro archive”, though I’m not sure what “Nitro” refers to. This is a generic uncompressed container that just holds some number of sub-files — in this case, 501. The subfiles can have names, but the ones in this game generally don’t, so the only way to refer to them is by number.

You may also notice that evo.narc and wotbl.narc, in the same directory, are also NARCs with 501 records. It’s a pretty safe bet that they all have one record per Pokémon. That’s a little odd, since Pokémon Diamond only has 493 Pokémon, but we’ll figure that out later.

NARC is, as far as I can tell, an invention of Nintendo. I think it’s in other DS games, though I haven’t investigated any others very much, so I can’t say how common it is. It’s a very simple format, and it uses basically the same structure as the entire DS filesystem: a list of start/end offsets and a list of filenames. It doesn’t have the same directory nesting, so it’s much simpler, and also the filenames are usually missing, so it’s simpler still. But you don’t have to care, because you can examine the contents of a file with:

This will print every record as an unbroken string of hex, one record per line. (I admit this is not the smartest format; it’s hard to see where byte boundaries are. Again, hopefully I’ll fix this up a bit when I rerip gen 4.) Here are the first six Pokémon records.

That first one is pretty conspicuous, what with its being all zeroes. It’s probably some a dummy entry, for whatever reason. That does make things a little simpler, though! Numbering from zero has caused some confusion in the past: Bulbasaur (National Dex number 1) would be record 0, and I’ve had all kinds of goofy bugs from forgetting to subtract or add 1 all over the place. With a dummy record at 0, that means Bulbasaur is 1, and everything is numbered as expected.

So, what is any of this? The heart of figuring out data formats is looking for stuff you know. That might mean looking for data you know should be there, or it might mean identifying common schemes for storing data.

A good start, then, would be to look at what I already know about Bulbasaur. Base stats are a pretty fundamental property, and Bulbasaur’s are 45, 49, 49, 65, 65, and 45. In hex, that’s 2d 31 31 41 41 2d. Hey, that’s the beginning of the first line! It’s just slightly out of order; Speed comes before the special stats.

You can also pick out some differences by comparing rows. About 60% of the way along the line, I see 03 for Bulbasaur, Ivysaur, and Venusaur, but then 00 for Charmander and Charmeleon. That’s different between families, which seems like a huge hint; does that continue to hold true? (As it turns out, no! It fails for Butterfree — because it indicates the Pokémon’s color, used in the Pokédex search. Most families are similar colors.)

Sometimes a byte will seem to only take one of a few small values, which usually means it’s an enum (one of a list of numbered options), like the colors are. A byte that only ranges from 1 to 17 (or perhaps 0 to 16) is almost certainly type, for example, since there are 17 types.

Noticing common patterns — very tiny formats, I suppose — is also very helpful (and saves you from wild goose chases). For example, Pokémon can appear in the wild holding held items, and there are more than 256 items, so referring to an item requires two bytes. But there are only slightly more than 256 items in this game, so the second byte is always 00 or 01. If you remember that some fields must span multiple bytes, that’s an incredible hint that you’re looking at small 16-bit numbers; if you forget, you might thing the 01 is a separate field that only stores a single flag… and drive yourself mad trying to find a pattern to it.

The games have a number of TMs which can teach particular moves, and each Pokémon can learn a unique set of TMs. These are stored as a longer block of bytes, where each individual bit is either 1 or 0 to indicate compatibility. Those are a bit harder to identify with certainty, since (a) the set of TMs changes in every game so you can’t just check what the expected value is, and (b) bitflags can produce virtually any number with no discernible pattern.

Thankfully, there’s a pretty big giveaway for TMs in particular. Here are Caterpie, Metapod, and Butterfree:

Butterfree can learn TMs. Caterpie and Metapod are almost unique in that they can’t learn any. Guess where the TMs are! Even better, Caterpie is only #9, so this shows up very early on.

And, well, that’s the basic process. It’s mostly about cheating, about leveraging every possible trick you can come up with to find patterns and landmarks. I even wrote a script for this (and several other files) that dumped out a huge HTML table with the names of the (known) Pokémon on the left and byte positions as columns. When I figured something out, or at least had a suspicion, I labelled the column and changed that byte to print as something more readable (e.g., printing the names of types instead of just their numbers).

Of course, if you have a flash cartridge or emulator (both of which were hard to come by at the time), you can always invoke the nuclear option: change the data and see what changes in the game.

What we really really wanted were the sprites. This was a new generation with new Pokémon, after all, and sprites were the primary way we got to see them. Unlike nearly everything else, this hadn’t already been figured out by other people by the time I showed up.

Finding them was easy enough — there’s a file named /poketool/pokegra/pokegra.narc, which is conspicuously large. It’s a NARC containing 2964 records. A little factoring reveals that 2964 is 494 × 6 — aha! There are 493 Pokémon, plus one dummy.

This will extract the contents of pokegra.narc to a directory called pokemon-diamond.nds:data, which I guess might be invalid on Windows or something, so use -d to give another directory name if you need to. Anyway, in there you’ll find a directory called pokegra.narc, inside of which are 2964 numbered binary files.

Some brief inspection reveals that they definitely come in groups of six: the filesizes consistently repeat 6.5K, 6.5K, 6.5K, 6.5K, 72, 72. Sometimes a couple of the files are empty, but the pattern is otherwise very distinct. Four sprites per Pokémon, then?

Let’s have a look at the first file! Since it’s a dummy sprite, it should be blank or perhaps a question mark, right? Oh boy I’m so excited.

Hm. Okay, so, this is a problem. No matter what the actual contents are, this is a sprite, and virtually all Pokémon sprites have a big ol’ blob of completely empty space in the upper-left corner. Every corner, in fact. Except for a handful of truly massive species, the corners should be empty. So no matter what scheme this is using or what order the pixels are in, I should be seeing a whole lot of zeroes somewhere. And I’m not.

Compression? Seems very unlikely, since every file is either 0, 72, or 6448 bytes, without exception.

Well, let’s see what we’ve got here. RGCN and RAHC are almost certainly magic numbers, so this is one file format nested inside another. (A lot of file formats start with a short fixed string identifying them, a so-called “magic number”. Every GIF starts with the text GIF89a, for example. A NARC file starts with CRAN — presumably it’s “backwards” because it’s being read as an actual little-endian number.) I assume the real data begins at 0x30.

Without that leading 0x30 (48) bytes, the file is 6400 bytes large, which is a mighty conspicuous square number! Pokémon sprites have always been square, so this could mean they’re 80×80, one byte per pixel. (Hm, but Pokémon sprites don’t need anywhere near 256 colors?)

I see a 30 in the first line, which is probably the address of the data. I also see a 10, which is probably the (16-bit?) length of that initial header, or the address of the second header. What about in the second header? Well, uh, hm. I see a lot of what seem to be small 16-bit or 32-bit numbers: 0x000a is 10, 0x0014 is 20, 0x0003 is 3; 0x0018 is 24. A quick check reveals that 0x1900 is 6400 (the size of the data), and so 0x1920 is the size of the data plus this second header.

This hasn’t really told me anything I don’t already know. It seems very conspicuous that there’s no 0x50, which is 80, my assumed size of the sprite.

Well, hm, let’s look at the second file. It’s in the block for the same “Pokémon”, so maybe it’ll provide some clues.

Ah. No. It starts out completely identical. In fact, md5sum reveals that all four of these first sprites are identical. Might make sense for a dummy Pokémon. Does that pattern hold for the next Pokémon, which I assume is Bulbasaur? Not quite! Files 6 and 7 are identical, and 8 and 9 are identical, but they’re distinct from each other.

What’s the point of them then? Further inspection reveals that most Pokémon have paired sprites like this, but Pikachu does not — suggesting (correctly) that the sprites are male versus female, so Pokémon that don’t have gender differences naturally have identical pairs of sprites.

Okay, then, let’s look at Pikachu’s first sprite, 150. The key is often in the differences, remember. If the dummy sprite is either blank or a question mark, then it should still have a lot of corner pixels in common with the relatively small Pikachu.

Make a histogram? Every possible value appears with roughly the same frequency. Now that is interesting, and suggests some form of encryption — most likely one big “mask” xor’d with the whole sprite. But how to find the mask?

(It doesn’t matter exactly what xor is, here. It only has two relevant properties. One is that it’s self-reversing, making it handy for encryption like this — (data xor mask) xor mask produces the original data. The other is that anything xor’d with zero is left unchanged, so if I think the original data was zero — as it ought to be for the blank pixels in the corners of a sprite — then the encrypted data is just the mask! So I know at least the beginning parts of the mask for most sprites; I just have to figure out how to use a little bit of that to reconstitute the whole thing.)

I stared at this for days. I printed out copies of sprite hex and stared at them more on breaks at work. I threw everything I knew, which admittedly wasn’t a lot, at this ridiculous problem.

And slowly, some patterns started to emerge. Consider the first digit of the last column in the above hex dump: it goes e, d, d, c, c, b, b, a. In fact, if you look at the entire byte, they go e0, d8, d0, c8, etc. That’s just subtracting 08 on each row.

Are there other cases like this? Kinda! In the third column, the second digit alternates between 7 and f; closer inspection reveals that byte’s increasing by 18 every row. Oh, the sixth column too. Hang on — in every column, the second digit alternates between two values. That seems true for every other file we’ve seen so far, too.

This is extremely promising! Let’s try this. Take the first two rows, which are bytes 0–15 and bytes 16–31. Subtract the second row from the first row bytewise, making a sort of “delta row”. For the second Pikachu, that produces:

1

d82b 3830 1809 789f 58ae b82c 9828 f827

As expected, the second digit in each column is an 8. Now just start with the first row and keep adding the delta to it to produce enough rows to cover the whole file, and xor that with the file itself. Results:

Promising! We got a bunch of zeroes, as expected, though everything else is still fairly garbled. It might help if we, say, printed this out to a file.

By now it had become clear that the small files were palettes of sixteen colors stored as RGB555 — that is, each color is packed into two bytes, with five bits each for the red, green, and blue channels. Sixteen colors means two pixels can be crammed into a single byte, so the sprites are actually 160×80, not 80×80. Combining this knowledge with the above partially-decrypted output, we get:

Success!

Kinda!

Meanwhile another fansite found our code and put up a full set of these ugly-ass corrupt sprites, so that was nice.

It took me a while to notice another pattern, which emerges if you break the sprite into blocks that are 512 bytes wide (rather than only 16). You get this:

This time, the byte in the first column is always identical all the way down. Well, kind of. This is encrypted data, remember, and I only know what the mask is because the beginning of the data is usually blank. The exceptions are when the mask is hitting actual colored pixels, at which point it becomes garbage.

But even better, look at the second byte in each column. Now they’re all separated by a constant, all the way down! That means I can repeat the same logic as before, except with two “rows” that are 512 bytes long, and as long as the first 1024 bytes of the original data are all zeroes, I’ll get a perfect sprite out!

And indeed, I did! Mostly. Legendary Pokémon and a handful of others tend to be quite large, so they didn’t start with as many zeroes as I needed for this scheme to work. But it mostly worked, and that was pretty goddamn cool.

magical, a long-time co-conspirator, managed to scrounge up my final “working” code from that era (which then helped me find my own local copy of all my D/P research stuff, which I oughta put up somewhere). It’s total nonsense, but it came pretty close to working.

…

Hm? What? You want to know the real answer? Yeah, I bet you do.

Okay, here you go. So the games have a random number generator, for… generating random numbers. This being fairly simple hardware with fairly simple (non-crypto) requirements, the random number generator is also fairly simple. It’s an LCG, a linear congruential generator, which is a really bizarre name for a very simple idea:

1

ax + b

The generator is defined by the numbers a and b. (You have to pick them carefully, or you’ll get numbers that don’t look particularly random.) You pick a starting number (a seed) and call that x. When you want a random number, you compute ax + b. You then take a modulus, which really means “chop off the higher digits because you only have so much space to store the answer”. That’s your new x, which you’ll plug in to get the next random number, and so on.

In the case of the gen 4 Pokémon games, a = 0x4e6d and b = 0x6073.

What does any of this have to do with the encryption? Well! The entire sprite is interpreted as a bunch of 16-bit integers. The last one is used as the seed and plugged into the RNG, and then it keeps spitting out a sequence of numbers. Reverse them, since you’re starting at the end, and that’s the mask.

The seed technically overlaps with the last four pixels, but it happens to work since no Pokémon sprites touched the bottom-right corner in Diamond and Pearl. In Platinum a couple very large sprites broke that rule, so they ended up switching this around and starting from the beginning. Same idea, though.

Of course, porigon-z knows how to handle this… though it’s currently hardcoded to use the Platinum approach. Funny story: the algorithm was originally thought to go from the beginning, not the end, and it used an LCG with different constants. Turns out someone had just accidentally (?) discovered the reverse of the Pokémon LCG, which would produce exactly the same sequence, backwards. Super cool.

Why did that thing with subtracting the two rows kinda-sorta work, then? Well! It’s because… when you… and… uh… wow, I still have no goddamn idea. That makes no sense at all. I’d sure love for someone to explain that to me. I’m sure I could explain it if I sat down and thought about it for a while, but I suspect it’s something subtle and I’m not that interested.

I’d also like to know: why were the sprites encrypted in the first place? What possible point is there? They must have known we cracked the encryption, but then they used it again for Platinum, and Heart Gold and Soul Silver. Maybe it was only intended to be enough to delay us, during the gap between the Japanese and worldwide releases…? Hm.

Incidentially, the entire game text is also encrypted in much the same way. Without the encryption, it’s just UTF-16 — a common character encoding that uses two bytes for every character. I have no idea why.

So Nintendo DS cartridge have a little filesystem on them, making them act kinda like any other disk. Nice.

Game Boy cartridges… don’t. A Game Boy cartridge is essentially just a single file, a program. You pop the cartridge in, and the Game Boy runs that program.

Where is the data, then? Baked into the program — referred to as hard-coded. Just, somewhere, mixed in alongside all the program code. There’s no nice index of where it is; rather, somewhere there’s some code that happens to say “look at the byte at 0x006f9d10 in this program and treat it as data”.

I wasn’t involved in data dumping in these days; I was copying stuff straight out of the wildly inaccurate Prima strategy guide. (Again, you know, I was 12.) It’s hard to say exactly how people fished out the data, though I can take a few guesses.

To our advantage is the fact that Game Boy cartridges are much smaller than DS cartridges, so there’s much less to sift through. Pokémon Red and Blue are on 1 MB cartridges, and even those are half empty (unused NULs); the original Japanese Red and Green barely fit into 512 KB, and Red and Blue ended up just slightly bigger.

To our disadvantage is that these are the very first games, so we don’t have any pre-existing knowledge to look for. We don’t know any Pokémon’s base stats; we may not even know that “base stats” are a thing yet. Also, it’s not immediately obvious, but the Pokémon aren’t even stored in order. Oh, and Mew is completely separate; it really was a last-minute addition.

What do we know? Well, by playing the game, we can see what moves a Pokémon learns and when. There don’t seem to be all that many moves, so it’s reasonable to assume that a move would be represented with a single byte. Levels are capped at 100, so that’s probably also a single byte. Most likely, the level-up moves are stored as either level move level move... or move level move level....

Great! All we need to do is put together a string of known moves both ways and find them.

Except, ah, hm. We don’t actually know how the moves are numbered. But we still know the levels, so maybe we can get somewhere. Let’s take Bulbasaur, which we know learns Leech Seed at level 7, Vine Whip at level 13, and Poison Powder at level 20. (Or, I guess, that should be LEECHSEED, VINEWHIP, and POISONPOWDER.) No matter whether the levels or moves come first, this will result in a string like:

1

07 ?? 0D ?? 14

So we can do my favorite thing and slap together a regex for that. (A regex is a very compact way to search text — or bytes — for a particular pattern. A lone . means any single character, so the regex below is a straightforward translation of the pattern above.)

This seems pretty promising! It looks like the same set of moves is repeated 16 bytes later, but with different (slightly higher) levels after a certain point, which matches how evolved Pokémon behave. So this looks to be at least Bulbasaur and Ivysaur, though I’m not quite sure what happened to Venusaur.

By repeating this process with some other Pokémon, we can start to fill in a mapping of moves to their ids. Eventually we’ll realize that a Pokémon’s starting moves don’t seem to appear within this structure, and so we’ll go searching for those for a Pokémon that starts with moves we know the ids for. That will lead us to the basic Pokémon data along with base stats, because starting moves happen to be stored there in these early games.

The text isn’t encrypted, but also isn’t ASCII, but it’s possible to find it in much the same way by treating it as a cryptogram (or a substitution cipher). I assume that there’s some consistent scheme, such that the letter “A” is always represented with the same byte. So I pick some text that I know has a few repeated letters, like BULBASAUR, and I recognize that it could be substituted in some way to read as 123145426. I can turn that into a regex!

Unfortunately, this produces a zillion matches, most of them solid strings of NUL bytes. The problem is that nothing in the regex requires that the different groups are, well, different. You could write extra code to filter those cases out, or if you’re masochistic enough, you could express it directly within the regex using (?!...) negative lookahead assertions:

That’s much more reasonable. (The set of matches, I mean, not the regex.) It wouldn’t be hard to write a script with a bunch of known strings in it, generate appropriate regexes for each, eliminate inconsistent matches, and eventually generate a full alphabet. (Or you could assume that “B” immediately follows “A” and in general the letters are clustered together, which would lead you to correctly suspect that the strings at 0x0001c80e and 0x00094e75 are the ones you want.)

Even better, once you have an alphabet, you can use it to translate entire ROM — plenty of it will be garbage, but you’ll find quite a lot of blocks of human-readable text! And now you have all the names of everything and also the entire game script.

But like I said, I wasn’t involved in any of that. Until recently! I’ve been working on an experiment for veekun where I re-dump all the games to a new YAML-based format. Long story short: the current database is a pain to work with, and some old data has been lost entirely. Also, most of the data was extracted bits at a time with short hacky scripts that we then threw away, and I’d like to have more permanent code that can dump everything at once. It’ll be nice to have an audit trail, too — multiple times in the past, we’ve discovered that some old thing was dumped imperfectly or copied from an unreliable source.

So I started re-dumping Red and Blue, from scratch. I’ve made modest progress, though it’s taken a backseat to Sun and Moon lately.

It’s helped immensely that there’s an open source disassembly of Red and Blue floating around. What on earth is a disassembly? I’m so glad you asked!

Game Boy games were written in assembly code, which is just about as simple and painful as you can get. It’s human-readable, kinda, but it’s built from the basic set of instructions that a particular CPU family understands. It can’t directly express concepts like “if this, do that, otherwise do something else” or “repeat this code five times”. Instead, it’s a single long sequence of basic operations like “compare two numbers” and “jump ahead four instructions”. (Very few programmers work with assembly nowadays, but for various reasons, no other programming languages would work on the Game Boy at the time.)

To give you a more concrete idea of what this is like to work with: the Game Boy’s CPU doesn’t have an instruction for multiplying, so you have to do it yourself by adding repeatedly. I thought that would make a good example, but it turns out that Pokémon’s multiply code is sixty lines long. Division is even longer! Here’s something a bit simpler, which fills a span of memory:

CPUs tend to have a small number of registers, which can hold values while the CPU works on them — even as fast as RAM is, it’s considered much slower than registers. The downside is, well, you only have a few registers. The Game Boy CPU (a modified Z80) has eight registers that can each hold one byte: a through f, plus h and l.. They can be used together in pairs to store 16-bit values, giving the four combinations af, bc, de, and hl.

(If you need more than 16 bits, well, that’s your problem! 16 bits is the most the CPU understands; it can’t even access memory addresses beyond that range, so you’re limited to 64K. “But wait”, you ask, “how can a Game Boy cartridge be 512K or 1M?” Very painfully, that’s how.)

Now we can understand the comment in the above code. Starting at the memory address given by the 16-bit number in hl, it will copy the value in a into each byte, for a total of bc bytes. Translated into English, the above means something like this:

Save copies of d and e, so I can mess with them without losing any important data that was in them. (This code doesn’t use e, but there’s no push d instruction.)

Copy a, the fill value, into d.

Copy d, the fill value, into a.

Copy a, the fill value, into the memory at address hl. Then increase hl (the actual registers, not the memory) by 1.

Decrease bc, the number of bytes to fill, by 1.

Copy b, part of the number of bytes to fill, into a.

ORa with c, the other part of the number of bytes to fill, and leave the result in a. The result will be zero only if bc itself is zero, in which case the “zero” flag will be set.

If the zero flag is not set (i.e., if bc isn’t zero, meaning we’re not done yet), jump back to the instruction marked by .loop, which is step 3.

Restore d and e to their previous values.

Return to whatever code jumped here in the first place.

Even this relatively simple example has to resort to a weird trick — ORing b and c together — just to check if bc, a value the CPU understands, is zero or not.

CPUs don’t execute assembly code directly. It has to be assembled into machine code, which is (surprise!) a sequence of bytes corresponding to CPU instructions. When the above code is compiled, it produces these bytes, which you can verify for yourself appear in Pokémon Red and Blue in exactly one place:

1

d5 57 7a 22 0b 78 b1 20 f9 d1 c9

I stress that this is way beyond anything virtually any programmer actually needs to know. Even the few programmers working with assembly, as far as I know, don’t usually care about the actual bytes that are spat out. I’ve actually had trouble tracking down lists of opcodes before — almost no one is trying to read machine code. We are out in the weeds a bit here.

To finally answer your hypothetical question: disassembly is the process of converting this machine code back into assembly. Most of it can be done automatically, but it takes extra human effort to make the result sensible. Let’s consult the Game Boy CPU’s (rather terse) opcode reference and see if we can make sense of this, pretending we don’t know what the original code was.

Find d5 in the table — it’s in row Dx, column x5. That’s push de. The first number in the box is 1, meaning the instruction is only one byte long, so the next instruction is the very next byte. That’s 57, which is ld d, a. Keep on going. Eventually we hit 20, which is jr nz, r8 and two bytes long — the notes at the bottom explain that r8 is 8-bit signed data. That means the next byte is part of this instruction; it’s f9, but it’s signed, so really that’s -7. We end up with:

This looks an awful lot like what we started with, but there are a couple notable exceptions. First, the FillMemory:: line is completely missing. That’s just another kind of label, and the only way to know that the first line should be labelled at all is to find some other place in the code that tries to jump here. Given just these bytes, we can’t even tell if this is a complete snippet. Once we find that out, there’s still no way to recover the name FillMemory; even that is just a fan name and not the name from the original code. Someone came up with that name by reading this assembly code, understanding what it’s intended to do, and giving it a name.

Second, the .loop label is missing. The jr line forgot about the label and ended up with a number, which is how many bytes to jump backwards or forwards. (You can imagine how a label is much easier to work with than a number of bytes, especially when some instructions are one byte long and some are two!) An automated disassembler would be smart enough to notice this and would put a label in the right place. A really good disassembler might even recognize that this code is a simple loop that executes some number of times, and name that label .loop; otherwise, or for more complicated kinds of jumps, it would have a meaningless name that a human would have to improve.

And there’s a whole project where people have done the work of restoring names like this and splitting code up sensibly! The whole thing even assembles into a byte-for-byte identical copy of the original games. It’s really quite impressive, and it’s made tinkering with the original games vastly more accessible. You still have to write assembly, of course, but it’s better than editing machine code. Imagine trying to add a new instruction in the middle of the loop above; you’d screw up the jr‘s byte count, and every single later address in the ROM.

But more relevant to this post, a disassembly makes it easy to figure out where data is, since I don’t have to go hunting for it! When the code is assembled, it can generate a .sym file, which lists every single “global” label and the position it ended up in the final ROM. Many of those labels are for functions, like FillMemory is, but some of them are for blocks of data.

I set out to write some code to dump data from Game Boy games. Red/Green, Red/Blue, and Yellow were all fairly similar, so I wanted to use as much of the same code as possible for all of those games (and their various translations).

A very early pain point was, well, the existence of all those translations. Because there’s no filesystem, the only obvious way to locate data is to either search for it (which requires knowing it ahead of time, a catch-22 for a script that’s meant to extract it) or to bake in the addresses. The games contain quite a lot of data I want, and they exist in quite a few distinct versions, so that would make for a lot of addresses.

Also, with a disassembly readily available, it was now (relatively) easy for people to modify the games as they saw fit, in much the same way as it’s easy to modify most aspects of modern games by changing the data files. But if I simply had a list of addresses for each known release, then my code wouldn’t know what to do with modified games. It’s not a huge deal — obviously I don’t intend to put fan modifications into veekun — but it seemed silly to write all this extraction code and then only have it work on a small handful of specific files.

I decided to at least try to find data automatically. How can I do that, when the positions of the data only existed buried within machine code somewhere?

Obviously, I just need to find that machine code. See, that whole previous section was actually relevant!

I set out to do that. Remember the goofy regex from earlier, which searched for particular patterns of bytes? I did something like that, except with machine code. And by machine code, I mean assembly. And by assembly, I mean— okay just look at this.

I wrote my own little assembler that can convert Game Boy assembly into Game Boy machine code. The difference is that when it sees something like #foo, it assumes that’s a value I don’t know yet and sticks in a regex capturing group instead. It’s smart enough to know whether the value has to be one or two bytes, based on the instruction. It also knows that if the same placeholder appears twice, it must have the same value both times. I can also pass in a placeholder value, if I only know it at runtime.

I have half a dozen or so chunks like this. Every time I wanted to find something new, I went looking for code that referenced it and copied the smallest unique chunk I could (to reduce the risk that the code itself is different between games, or in a fan hack). I did run into a few goofy difficulties, such as code that changed completely in Yellow, but I ended up with something that seems to be pretty robust and knows as little as possible about the games.

I even auto-detect the language… by examining the name of the TOWNMAP, the first item that has a unique name in every official language.

This is probably ridiculous overkill, but it was a blast to get working. It also leaves the door open for some… shenanigans I’ve wanted to do for a while.

Recent games have been slightly more complicated, though the complexity is largely in someone else’s court. The 3DS uses encryption — real, serious encryption, not baby stuff you can work around by comparing rows of bytes.

When X and Y came out, the encryption still hadn’t been broken, so all of veekun’s initial data was acquired by… making a Google Docs spreadsheet and asking for volunteers to fill it in. It wasn’t great, but it was better than nothing.

This was late 2013, and I suppose it’s around when veekun’s gentle decline into stagnation began. When X and Y were finally ripped, I was… what was I doing? I guess I was busy at work? For whatever reason, I had barely any involvement in it. Then Omega Ruby and Alpha Sapphire came out, and now everyone was busy, and it took forever just to get stuff like movesets dumped.

Now I’m working on Sun and Moon again. It’s not especially hard — much of the basic structure has been preserved since Diamond and Pearl, and a lot of the Omega Ruby and Alpha Sapphire code I wrote works exactly the same with Sun and Moon — but there are a lot of edge cases.

The most obvious wrinkle is that the filenames are gone. This has actually been the case since Heart Gold and Soul Silver — all the files now simply have names like /a/0/0/0 and count upwards from there. I don’t know the reason for the change, but I assume the original filenames weren’t intended to be left in the game in the first place. The files move around in every new pair of games, too, requiring bunches of people to go through the files by hand and note down what each of them appears to be.

Newer games use GARC instead of NARC. I don’t know what the G stands for now. (Gooder? Gen 6?) It’s basically the same idea, except that now a single GARC archive has two levels of nesting — that is, a GARC contains some number of sub-archives, and each of those in turn contains some number of subfiles. Usually there’s only one subfile per subarchive, but I’ve seen zanier schemes once or twice.

Also, just in case that’s not enough levels of containers for you, there are also a number of other single-level containers embedded inside GARCs. They’re all very basic and nearly identical: just a list of start and end offsets.

Oh, and some of the data is compressed now. (Maybe that was the case before X/Y? I don’t remember.) Compression is fun. Any given data might be compressed with one of two flavors of LZSS, and it seems completely arbitrary what’s compressed and what’s not. There’s no indication of what’s compressed or what’s not, either; the only “header” that compressed data has is that the first byte is either 0x10 or 0x11, which isn’t particularly helpful since plenty of valid data also begins with one of those bytes.

But there was one much bigger practical problem with X and Y, one I’d been dreading for a while. X and Y, you see, use models — which means they don’t have any sprites for us to rip at all. And that kind of sucks.

The community’s solution has been for a few people (who have screen-capture hardware) to take screenshots and cut them out. It works, but it’s not great. The thing I’ve wanted for a very long time is rips of the actual models.

(Later games went back to including “sprites” that are just static screenshots of the models. Maybe out of kindness to us? Okay, yeah, doubtful. Oh, and those sprites are in the incredibly obtuse ETC1 format, which I had never heard of and needed help to identify, and which I will let you just read about yourself.)

The Pokémon models are an absolute massive chunk of the game data. All the data combined is 3GB; the Pokémon models are a hair under half of that, despite being compressed.

At least this makes them easy to find, since they’re all packed into a single GARC file.

That file contains, I don’t know, a zillion other files. And many of those files are generic containers, containing more files. And none of these files are named. Of course. It’s easy enough to notice that there are nine files per Pokémon, since the sizes follow a rough pattern like the sprites did in Diamond and Pearl. (You’d think that they’d use GARC’s two levels of nesting to group the per-Pokémon files together, but apparently not.)

At this point, I had zero experience with 3D — in fact, working on this was my introduction to 3D and Blender — so I didn’t get very far on my own. I basically had to wait a few years for other people to figure it out, look at their source code, replicate it myself, and then figure out some of the bits they missed. The one thing I did get myself was texture animations, which are generally used to make Pokémon change expressions — last I saw, no one had gotten those ripped, but I managed it. Hooray. I’m sure someone else has done the same by now.

Anyway, I bring up models because of two very weird things that I never would’ve guessed in a million years.

One was the mesh data itself. A mesh is just the description of a 3D model’s shape — its vertices (points), the edges between vertices, and the faces that fill in the space between the edges.

And, well, those are the three parts to a basic mesh. A few very simple model formats are even just those things written out: a list of vertices (defined by three numbers each, x y z), a list of edges (defined by the two vertices they connect), and a list of faces (defined by their edges).

It should be easy to find models by looking for long lists of triplets of numbers — vertex coordinates. Well, not quite. Pokémon models are stored as compiled shaders.

A shader is a simple kind of program that runs directly on a video card, since video cards tend to be a more appropriate place for doing a bunch of graphics-related math. On a desktop or phone or whatever, you’d usually write a shader as text, then compile it when your program/game runs. In fact, you have to do this, since the compilation is different for each kind of video card.

But Pokémon games only have to worry about one video card: the graphics chip in the 3DS. And there’s absolutely no reason to waste time compiling shaders while the game is running, when they could just do it ahead of time and put the compiled shader in the game directly. (Incidentally, the Dolphin emulator recently wrote about how GameCube games do much the same thing.)

So they did that. Thankfully, the compiled shader is much simpler than machine code, and the parts I care about are just the parts that load in the mesh data — which mostly looks like opcodes for “here’s some data”, followed by some data. It would probably be possible to figure out without knowing anything about the particular graphics chip, but if you didn’t know it was supposed to be a shader, you’d be mighty confused by all the mesh data surrounded by weird extra junk that doesn’t look at all like mesh data.

The other was skeletal animation. The basic idea is that you want to make a high-resolution model move around, but it would be a huge pain in the ass to describe the movement of every single vertex. Instead, you make an invisible “skeleton” — a branching tree of bones. The bones tend to follow major parts of the body, so they do look like very simple skeletons, with spines and arms and whatnot (though of course skeletons aren’t limited only to living creatures). Every vertex attaches to one or more of those bones — a rough first pass of this can be done automatically — and then by animating the much simpler skeleton, vertices will all move to match the bones they’re attached to.

The skeleton itself isn’t too surprising. It’s a tree, whatever; we’ve seen one of those already, with the DS directory structure. The skeletons and models are in a neutral pose by default: T for bipeds, simply standing on all fours for quadrupeds, etc. All of this is pretty straightforward.

But then there are the animations themselves.

An animation has some number of keyframes which specify a position, rotation, and size for each bone. Animating the skeleton involves smoothly moving each bone from one keyframe’s position to the next.

Position, rotation, and size each exist in three dimensions, so there are nine possible values for each keyframe. You might expect a set of nine values, then, times the number of keyframes, times the number of bones.

But no! These animations are stored the other way: each of those nine values is animated separately per bone. Also, each of those nine values can have a different number of keyframes, even for the same bone. Also, each of those nine values is optional, and if it’s not animated then its keyframes are simply missing, and there’s a set of bitflags indicating which values are present.

Okay, well, you might at least expect that a single value’s keyframes are given by a list of numbers, right?

Not quite! Such a set of keyframes has an initial “scale” and “offset”, given as single-precision floating point numbers (which are fairly accurate). Each keyframe then gives a “value” as an integer, which is actually the numerator of a fraction whose denominator is 65,535. So the actual value of each keyframe is:

1

offset + scale * value / 65535

Maybe this is a more common scheme than I suspect. Animation does take up an awful lot of space, and this isn’t an entirely unreasonable way to squish it down. The fraction thing is just incredibly goofy at first blush. I have no idea how anyone figured out what was going on there. (It’s used for texture animation, too.)

Anyway, thanks mostly to other people’s hard work, I managed to write a script that can dump models and then play them with a pretty decent in-browser model viewer. I never got around to finishing it, which is a shame, because it took so much effort and it’s so close to being really good. (My local copy has texture animation mostly working; the online version doesn’t yet.)

Hopefully I will someday, because I think this is pretty dang cool, and there’s a lot of interesting stuff that could be done with it. (For example, one request: applying one Pokémon’s animations to another Pokémon’s model. Hm.)

The one thing that really haunts me about it is the outline effect. It’s not actually the effect from the games; I had to approximate it, and there are a few obvious ways it falls flat. I would love to exactly emulate what the games do, but I just don’t know what that is. But maybe… maybe there’s a chance I can find the compiled shader and figure it out. Maybe. Somehow.

Let’s finish up with some small bumps in the road that are still fresh in my mind.

TMs are still in the Pokémon data, as is compatibility with move tutors. Alas, the lists of what the TMs and tutor moves are are embedded in the code, just like in the Game Boy days. You don’t really need to know the TM order, since they have a numbering exposed to the player in-game, and TM compatibility is in that same order… but move tutors have no natural ordering, so you have to either match them up by hand or somehow find the list in the binary.

I had a similar problem with incense items, which each affect the breeding outcome for a specific Pokémon. In Ruby and Sapphire, the incense effects were hardcoded. I don’t mean they were a data structure baked into the code; I mean they were actually “if the baby is this species and one parent is holding this incense, do this; otherwise,” etc. I spent a good few hours hunting for something similar in later games, to no avail — I’d searched for every permutation of machine code I could think of and come up with nothing. I was about to give up when someone pointed out to me that incense is now a data structure; it’s just in the one format I’d forgotten to try searching for. Alas.

Moves have a bunch of metadata, like “status effect inflicted” or “min/max number of turns to last”. Trouble is, I’m pretty sure that same information is duplicated within the code for each individual move — most moves get their own code, and there’s no single “generic move” function. Which raises the question… what is this metadata actually used for? Is it useful to expose on veekun? Is it guaranteed to be correct? I already know that some of it is a little misleading; for example, Tri Attack is listed as inflicting some mystery one-off status effect, because the format doesn’t allow expressing what it actually does (inflict one of burn, freezing, or paralysis at random).

Items have a similar problem: they get a bunch of their own data, but it’s not entirely clear what most of it is used for. It’s not even straightforward to identify how the game decides which items go in which pocket.

Moves also have flags, and it took some effort to figure out what each of them meant. Sun and Moon added a new flag, and I agonized over it for a while before I was fed the answer: it’s an obscure detail relating to move animations. No idea how anyone figured that out.

In Omega Ruby and Alpha Sapphire, there are two lists of item names. They’re exactly identical, with one exception: in Korean, the items “PP Up” and “PP Max” have their names written with English letters “PP” in one list but with Hangul in the other list. Why? No idea.

Evolution types are numbered. Method 4 is a regular level up; method 5 is a trade; method 21 is levelling up while knowing a specific move, which is given as a parameter. Cool. But there are two oddities. Karrablast and Shelmet only evolve when traded with each other, but the data doesn’t indicate this in any way; they both get the same unique evolution method, but there’s no parameter to indicate what they need to be traded with, as you might expect. Also, Shedinja isn’t listed as an evolution at all, since it’s produced as a side effect of Nincada’s evolution (which is listed as a normal level-up). To my considerable chagrin, that means neither of these cases can be ripped 100% automatically.

Pokémon are listed in a different order, depending on context. Sometimes they’re identified by species, e.g. Deoxys. Sometimes they’re identified by form, e.g. Attack Forme Deoxys. Sometimes they’re identified by species and also a separate per-species form number. Sometimes the numbering includes aesthetic-only forms, like Unown, that only affect visuals. But sprites and models both seem to have their own completely separate numberings, which are (of course) baked into the binary.

Incidentally, it turns out that all of the Totem Pokémon in Sun and Moon count as distinct forms! They’re just not obtainable. Do I expose them on veekun, then? I guess so?

Encounters are particularly thorny. The data is simple enough: for each map, there’s a list of Pokémon that can be encountered by various methods (e.g. walking in grass, fishing, surfing). But each of those Pokémon appears at a different rate, and those rates are somewhere in the code, not in the data. And there are some weird cases like swarms, which have special rules. And there are unique encounters that aren’t listed in this data at all, and which veekun has thus never had. And how do you even figure out where a map is anyway, when a named place can span multiple maps, and the encounters are only very slightly different in each map?

Anyway, that’s why veekun is taking so long. Also because I’ve spent several days not working on veekun so I could write this post, which could be much longer but has gone on more than long enough already. I hope some of this was interesting!

Oh, and all my recent code is on the pokedex GitHub. The model extraction stuff isn’t up yet, but it will be… eventually? Next time I work on it, maybe?

Tags

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.