Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

$\begingroup$Are you questioning the math formulae, or the translation between the math formulae and code? I.e. do you want to know why the cost function is expressed as a sum and the gradient calculation is expressed as a matrix multiplication; or do you want to understand why the $y_i\text{log}(a_i)$ becomes Y * np.log(A) whilst $X(A-Y)^T$ becomes np.dot(X, dz.T)?$\endgroup$
– Neil SlaterAug 22 '17 at 10:07

2

$\begingroup$Thanks Neil. Sorry for the ambiguity. The second. I understand the math formulae. I just can't get my head around the intuition for the dot multiplication in one, and element wise multiplication in the other$\endgroup$
– GhostRiderAug 22 '17 at 10:08

3 Answers
3

In this case, the two math formulae show you the correct type of multiplication:

$y_i$ and $\text{log}(a_i)$ in the cost function are scalar values. Composing the scalar values into a given sum over each example does not change this, and you never combine one example's values with another in this sum. So each element of $y$ only interacts with its matching element in $a$, which is basically the definition of element-wise.

The terms in the gradient calculation are matrices, and if you see two matrices $A$ and $B$ multiplied using notation like $C = AB$, then you can write this out as a more complex sum: $C_{ik} = \sum_j A_{ij}B_{jk}$. It is this inner sum across multiple terms that np.dot is performing.

In part your confusion stems from the vectorisation that has been applied to equations in the course materials, which are looking forward to more complex scenarios. You could in fact use cost = -1/m * np.sum( np.multiply(np.log(A), Y) + np.multiply(np.log(1-A), (1-Y))) or cost = -1/m * np.sum( np.dot(np.log(A), Y.T) + np.dot(np.log(1-A), (1-Y.T))) whilst Y and A have shape (m,1) and it should give the same result. NB the np.sum is just flattening a single value in that, so you could drop it and instead have [0,0] on the end. However, this does not generalise to other output shapes (m,n_outputs) so the course does not use it.

$\begingroup$"So each element of y only interacts with its matching element in a, which is basically the definition of element-wise" - incredibly lucid explanation.$\endgroup$
– GhostRiderAug 22 '17 at 11:46

Are you asking, what's the difference between a dot product of two vectors, and summing their elementwise product? They are the same. np.sum(X * Y) is np.dot(X, Y). The dot version would be more efficient and easy to understand, generally.

But in the cost function, $Y$ is a matrix, not a vector. np.dot actually computes a matrix product, and the sum of those elements is not the same as the sum of the elements of the pairwise product. (The multiplication isn't even going to be defined for the same cases.)

So I guess the answer is that they're different operations doing different things, and these situations are different, and the main difference is dealing with vectors versus matrices.

$\begingroup$Thanks. That's not quite what I'm asking. See the alternative code I have for the cost function (last bit of code). This is incorrect, but I'm trying to understand why it's incorrect.$\endgroup$
– GhostRiderAug 22 '17 at 10:31

2

$\begingroup$In the OP's case np.sum(a * y) is not going to be same as np.dot(a, y) because a and y are column vectors shape (m,1), so the dot function will raise an error. I am pretty sure this is all from coursera.org/learn/neural-networks-deep-learning (a course I just looked at recently), because the notation and code is an exact match.$\endgroup$
– Neil SlaterAug 22 '17 at 10:41

With regards to "In the OP's case np.sum(a * y) is not going to be same as np.dot(a, y) because a and y are column vectors shape (m,1), so the dot function will raise an error. "...

(I don't have enough kudos to comment using the comment button but I thought I
would add..)

If the vectors are column vectors and have shape (1,m),
a common pattern is that the second operator for the dot
function is postfixed with a ".T" operator to transpose it to shape (m,1)
and then the dot product works out as a (1,m).(m,1). e.g.

np.dot(np.log(1-A), (1-Y).T)

The common value for m enables the dot product (matrix multiplication)
to be applied.

Similarly for column vectors one would see the transpose applied
to the first number e.g np.dot(w.T,X) to put the dimension that is >1
in the 'middle'.

The pattern to get a scalar from np.dot is to get the two vectors shapes to have the '1' dimension on the 'outside' and the common >1 dimension on the 'inside':

(1,X).(X,1) or np.dot( V1, V2 ) Where V1 is shape (1,X) and V2 is shape (X,1)