Writings

Data represented as dataframes are generally much easier to transform, filter, or write to a target source. In Spark, loading or querying data from a source will automatically be loaded as a dataframe.

Here’s an example of loading, querying, and writing data using PySpark and SQL:

The example above works conveniently if you can easily load your data as a dataframe using PySpark’s built-in functions. But sometimes you’re in a situation where your processed data ends up as a list of Python dictionaries, say when you weren’t required to use spark.read and/or session.sql. How can you load your data as a Spark DataFrame in order to take advantage of its capabilities?

In logistic regression, the loss function ​ and the cost function J take the forms

where ​ is the sigmoid of the linear superposition of the features represented in matrix form, with ​ as the vector of the feature weights, and ​ as the intercept term. The sigmoid of some function ​ is defined as

To understand why ​ and ​ take such forms, first note that ​ is the probability of the binary classification variable ​ to be equal to a positive example (​).

It’s a great thing to constantly have goals that require prolonged periods of deep concentration. This is something I always look forward to. Deep work gives us a sense of great accomplishment when we’re finished, as well as having expanded our expertise on the domains we’ve tackled during the process.

But of course, many of us don’t buy into the “delayed gratification” thing perhaps due to biological and historical reasonsSadly, many of us never liked school in the past, and would be happy to never go back to school again. . Nature, by default, always follows the path of lowest energy and/or least resistance.

I’ve always thought of a bubble as a compounded result of residual greed when the optimism of the many are perpetually validated.

This reminds me of what Warren Buffett tells us to do when we see compelling evidence of an impending bubble:

“Be fearful when others are greedy, and greedy when others are fearful.” – Warren Buffett

But only a few have the discipline to do this, because everyone can easily forget about the principle when they’re constantly seduced by social proofs of getting rich by everyone they know, everywhere they look.

Let’s say you have data containing some metrics and their values across an ordered set of dates in a week. Since most screens are longer horizontally than vertically, it’s sometimes better to present data where one metric lies in a row and the dates lie in columns, rather than the usual way around.

The usual way we show tables is like this:

⊕

date

Visitors

Orders

Revenue

Metric4

etc.

2016-02-28

1423

19

900

…

…

2016-02-29

1534

38

2037

…

…

2016-03-01

2645

57

5612

…

…

…

…

…

…

…

…

Because most screens are in landscape mode and because we read from left to right, there are times when it makes sense to pivot the table as follows:

⊕

metric

2016-02-28

2016-02-29

2016-03-01

…

Visitors

1423

1534

2645

…

Orders

19

38

57

…

Revenue

900

2037

5612

…

Metric4

…

…

…

…

Metric5

…

…

…

…

etc.

…

…

…

…

This may not be “tidy data” as defined by Hadley Wickham in his excellent paper, but pivoting as such results in easier navigation/scrolling when you have more metrics than dates.

Running an online business that’s growing slower than projected is never an ideal scenario. What can tremendously help diagnose the problem is to have data and know how to gain insights from it. It is only through the collection and analysis of data where you can free yourself from guesswork, start validating assumptions, and gain insights on how you should be operating your business.