Xingchen Yu

What is your research area?

Bayesian statistical modeling. It's a specific way of modeling statistics that incorporates prior knowledge instead of just the data. You don't want to let the data determine everything, so you use a Bayesian way of thinking, which provides a more structured way of determining the outcome. These days, artificial intelligence and machine learning use this kind of ideology. I came here to focus more on the statistical way of thinking, Bayesian modeling, as it relates to machine learning and high performance computing. I need to use a lot of high performance computing to speed up my algorithms and make my algorithms scalable to big data sets.

What are some of the potential applications of this?

For example, currently our application only focuses on analyzing the political science paradigm as left and right. The data set comes from voting in congress, so every session a congress member votes for or against bills and after the entire session we have aggregated a lot of votes and we have the 1 and 0, yes and no, and we also have data for those who did not participate. We want to put each of these individual congress members in their corresponding left and right locations—not the binary left and right, but on a line—you know, who is more extreme than one another? The existing algorithm tries to analyze their political preferences with the Euclidean distance, where you put everyone in a line in terms of their voting only. But this is not appropriate because the original algorithm put those on the extremes as centrist. Political scientists have no idea why this is the result

So it couldn't handle extreme cases. Therefore our model, instead of taking the Euclidean space as a given, proposes a spherical case. The distance between left and right is now is replaced with the arc distance and we are able to put the extremes back to their correct place.

People think that if members of congress vote the same way on a bill, politically speaking, they are agreeing on this bill for the same reason. But actually it's not the case. I'll give you a quick example—after Hurricane Sandy, you have those from the left opposing the disaster relief because they think it is not enough money, but on the right people think it is too much money. So people on opposite extremes voted "no" for the same bill, for very different reasons.

The first thing we did with this model was try to put these people in the place they're supposed to be, and try to understand their political preferences very well instead of using the very simple Euclidean distance.

How is this being used in other fields?

There's an idea called transitivity of preferences: if you have A greater than B, B greater than C, you assume that A is also greater than C. That's not necessarily true in a lot of human behavior, not just in political science but in things like purchasing. This is why we want to adapt our models next to complex things like modeling consumers' preferences, recommending what to buy. With the existing models on sites like Amazon and eBay, when you click on something they learn your behavior over time and they predict what you're going to buy next and make suggestions. That model assumes that everything lies in the Euclidean distance. That might not be true because of the transitivity of preferences. Our next step is to expand our model beyond a political science paradigm and into the recommender system paradigm.

If that's our next step, then high performance computing is a necessity. Because then the data will be 2 million points, 10 million points, and more. How are you going to store all that data? Before 2012, everything could be stored in one computer. But now, there's too much data. If it can't all be stored in one computer, then our algorithm has to change.

One of the most striking examples is in the field of linguistics, where we are able to look at the frequency domain. When someone says a word and it shows up in the frequency data, they were able to look at these frequencies and know exactly which words a person was saying without hearing the audio, just based on tone and frequency. Machine learners back in the day were so impressed by this, and they used that domain-specific knowledge in terms of feature engineering to be able to distinguish between words. These days these features are learned automatically.

Domain-specific knowledge is very important. I think the misconception is that statistics is percentages, a survey, but it's not just limited to that. Survey analysis is powerful, but our field has changed. Now if you write something on a website, there's sentiment behind it. We can analyze your Tweets and Facebook posts to understand your sentiment. We use that to predict stock price, for example. We pull out Tweets that say "apple," then determine if we're talking about real apples or about Apple the company, then predict sentiments based on what people are saying, then predict Apple stock prices.

That's also what statistics can do.

What do you think people misunderstand about statisticians?

Especially after the boom of machine learning and deep learning, people kind of forget how much statistics do. In undergrad we didn't learn a lot of statistics. After 2012 our computer scientists, statisticians, as well as linguists, psychologists—a lot of different disciplines—worked together to tackle problems we didn't think were possible 10 years ago. For example, they improved voice recognition by 20% in 2 years, which is ridiculous. People from all different disciplines contributed their own knowledge.

These days if you consider the skills sets you need to be an artificial intelligence engineer, you will see that statistics, mathematics, and computer science are all necessary. That's because not only do you need your domain-specific knowledge, you need to use every tool out there to do well.

As for misconceptions about statisticians, I think a lot of people don't know what we are doing. If you look at the papers published in machine learning, everyone is doing things very similarly with a slightly different approach. Most of the interesting tasks these days require people to produce more complex outcomes instead of just a 1 or 0 prediction. And deep learning models fail to do that. Statisticians are working together with computer scientists to tackle the most difficult stuff.

Deep learning is already outperforming human beings in a lot of tasks. Common questions are "What is the probability of this failing?" and "What's the potential cost of that?" Bayesian modeling produces a natural framework for answering those.

Computer scientists are very curious. What is AI? What is statistical learning? Most of them want to do statistical science in the future but they don't know exactly what statisticians do. They think AI is only about coding, but that's not all. Coding efficiently is an important technique, but you need a greater understanding to be successful.

What do you like best about UC Santa Cruz?

I like the trees. I feel like I am actually a scholar when I come to campus, walk in the trees, go to the lab. I also like the gym. From the second floor treadmills you can see Monterey Bay when you run. I think the people are very nice—maybe the good weather helps.

What do you want to do next?

I want to have my own company. I feel like I am acquiring the tools. Getting a PhD here does not only teach you the research in your area, it teaches you how to do any research and understand others' research. For example, if someone is developing an algorithm entirely different from yours, after training in this PhD program, you have the capacity to read the paper and code it and make it work the next day. Being a PhD student here has equipped me with those tools. And the proximity to Silicon Valley is unbeatable.