Descriptions of Objectives and Processes of Mechanical Learning 1footnote 11footnote 1Great thanks for whole heart support of my wife. Thanks for Internet and research contents contributers to Internet.

Descriptions of Objectives and Processes of
Mechanical Learning
111Great thanks for whole heart support of my wife. Thanks for Internet and research contents contributers to Internet.

Abstract

In [1], we introduced mechanical learning and proposed 2 approaches to mechanical learning. Here, we follow one such approach to well describe the objects and the processes of learning. We discuss 2 kinds of patterns: objective and subjective pattern. Subjective pattern is crucial for learning machine. We prove that for any objective pattern we can find a proper subjective pattern based upon least base patterns to express the objective pattern well. X-form is algebraic expression for subjective pattern. Collection of X-forms form internal representation space, which is center of learning machine. We discuss learning by teaching and without teaching. We define data sufficiency by X-form. We then discussed some learning strategies. We show, in each strategy, with sufficient data, and with certain capabilities, learning machine indeed can learn any pattern (universal learning machine). In appendix, with knowledge of learning machine, we try to view deep learning from a different angle, i.e. its internal representation space and its learning dynamics.

If you want to know the taste of a pear, you must change the pear by eating it yourself. ……
All genuine knowledge originates in direct experience.
—-Mao Zedong

But, though all our knowledge begins with experience, it by no means follows that
all arises out of experience.
—-Immanuel Kant

Our problem, …… is to explain how the transition is made from a lower level of knowledge
to a level that is judged to be higher.
—-Jean Piaget

1 Introduction

Mechanical learning is a computing system that is based on a simple set of fixed rules (so called mechanical), and can modify itself according to incoming data (so called learning). A learning machine M is a system that realizes mechanical learning.

In [1], we introduced mechanical learning and discussed some basic aspects of it. Here, we are going to continue the discussion of mechanical learning. As we proposed in [1], there are naturally 2 ways to go: to directly realize one learning machine, or to well describe what mechanical learning is really doing. Here, we do not try to design a specific learning machine, instead, we focus on describing the mechanical learning, specially, the objects and the process of learning, and related properties. Again, the most important assumption is mechanical, i.e., the system must follow a set of simple and fixed rules. By posting such requirement on learning, we can go deeper and reveal more interesting properties of learning.

In section 2, we discuss more about learning machine. We show one useful simplification: a N-M learning machine can be reduced to M independent N-1 learning machines. This simplification could help us a lot. We define level 1 learning machine in section 2. This concept clarifies a lot of confusing.

The driving force of a learning machine M is its incoming data, and incoming data forms patterns. Thus, we need to understand pattern first. In section 3, we discuss patterns and examples. In the process of understanding pattern, what is objective and what is subjective is naturally raised. In fact, these issues are very crucial to learning machine. Objective patterns and their basic operators are straightforward. In order to understand subjective pattern, we discuss how learning machine to perceive and process pattern. Such discussions lead us to subjective pattern and basic operators on them. We introduce X-form for subjective expressions, which will play central role in our later discussions. We prove that for any objective pattern we can find a proper X-form based upon least base patterns and to express the objective pattern well.

Learning by teaching, i.e. learning driving by a well designed teaching sequence (a special kind of data sequence), is a much simpler and effective learning. Though learning by teaching is only available in very rare cases, it is very educational to discuss it first. This is what we do in section 4. We show if a learning machine has certain capabilities, we can make teaching sequence so that under driven of such teaching sequence, it learns effectively. So, with these capabilities, we have an universal learning machine.

From learning by teaching, we get insight that the most crucial part of learning is abstraction from lower to higher. We try to apply such insights to learning without teaching. In section 5, we first defined mechanical learning without teaching. Then we introduce internal representation space, which is the center of learning machine and best to be expressed by X-forms. Internal representation space is actually where learning is happening. We write down the formulation of learning dynamics, which gives a clear picture about how data drives learning. However, one big issue is how much data are enough to drive the learning to reach the target. With the help of X-form and sub-form, we define data sufficiency: sufficient to support a X-form, and sufficient to bound a X-form. Such sufficiency gives a framework for us to understand data used to drive learning. We then show that by a proper learning strategy, with sufficient data, with certain learning capabilities, a learning machine indeed can learn. We demonstrate 3 learning strategies: embed into parameter space, squeezing to higher abstraction from inside, and squeezing to higher abstraction from inside and outside. We show that the first learning strategy is actually what deep learning is using (see Appendix for details). And, we show that by other 2 learning strategies with certain learning capabilities, a learning machine can learn any pattern, i.e. it is an universal learning machine. Squeezing to higher abstraction and more generalization is one strategy that we invent here. We believe that this strategy would work well for many learning tasks. We need to do more works in this direction.

In Section 6, we put more thoughts about learning machine. We will continue work on these directions. In section 7, we briefly discuss some issues of designing a learning machine.

In Appendix, we view deep learning (restricted to the stacks of RBMs) from our point of view, i.e. internal representation space. We start discussions from simplest, i.e. 2-1 RBM, then 3-1 RBM, N-1 RBM, N-M RBM, and stacks of RBM, and deep learning. In this way, it is clear that deep learning is using the learning strategy: embed a group of X-forms into parameter space that we discuss in section 5.

As in [1], for the same reason, here we will restrict to spatial learning, not consider it temporal learning.

2 Learning Machine

IPU – Information Processing Unit
We have discussed mechanical learning in [1]. A learning machine is a concrete realization of mechanical learning. We can briefly recall them here. See the illustration of IPU (Information Processing Unit):

Fig. 1. Illustration of N-M IPU (Information Processing Unit)

One N-M IPU has input space (N bits) and output space (M bits), and it will process input to output. If the processing are adapting according to input and feedback to output, and such adapting is governed by a set of simple and fixed rules, we call such adapting as mechanical learning, and such IPU as learning machine. Notice the phrase ”a set of simple and fixed rules”. This is a strong restriction. Mostly, we use this phrase to rule out human intervention. And, we pointed out this: since the set of adapting rules is fixed, we can reasonablly think the adapting rules are built inside learning machine at the setup.

We will try to well describe learning machine. First, we can put one simple observation here.

Theorem 2.1

One N-M IPU M is equivalent to MN-1 IPU Mi,i=1,…,M.

Proof: The output space of M is M-dim, so, we assume it is (v1,v2,…,vM). If we project to first component, i.e. v1, we get a N-1 IPU, denote it as: M1. We can do same for vi,i=2,…,M, and get N-1 IPUs: M2,…,MM. This tells us, if we have one N-M IPU M, we can get MN-1 IPU M1,…, so that M = (M1,M2,…MM).

On the other side, if we have MN-1 IPU, M1,M2,…,MM, we can use them to form a N-M IPU in this way:
M = (M1,M2,…MM).

Though this theorem is very simple, it can make our discussion much simpler. For most time, we can only consider N-1 IPU, which is much simpler to discuss. However, this is only to consider IPU, i.e. ability to process information. For learning, we need to consider more. See theorem 2.

The purpose or target of learning machine:One learning machine is one IPU, i.e. it will do information processing for each input and
generate output, it maps one input (a N-dim binary vector) to a M-dim binary vector. This is what a CPU does as well (More abstractly, since we do not restrict the size of N and M, any software without temporal effect can be thought as one IPU).

However, learning machine and CPU have very different goal. One CPU is designed to distinguish a
input b∈PS0N from any other, even there is only one bit difference, i.e. bit-wise. Yet, IPU and learning machine are not designed for such purpose. IPU and learning machine are designed to distinguish patterns. It should generate different output for different patterns, but, should generate same output for different inputs of a same pattern. That is to say, the target of a learning machine is to learn to distinguish a group of base patterns and how to process them. Thus, we need to understand patterns. Actually, to understand patterns is the most essential job, which is done in next section.

Data The purpose of a learning machine is to learn, i.e. to modify its information processing. However, we would emphasis that for mechanical learning, learning is driven by data fed into it.

Definition 2.2 (Data Sequence)

If we have a sequence Ti,i=1,2,…, and Ti=(bi,oi), where bi is a base pattern, oi is either ∅ (empty) or a binary vector in output space, we call this sequence a data sequence.

Note, oi could be empty or a vector in output space. If it is non-empty, it means that at the moment, the vector should be the value of output. If it is empty, it means there is no data for output match up. Learning machine should be able to learn even oi is empty. Of course, with value of output, the learning is often easier and faster.

We can easily see that data sequence is the only information source for a learning machine to modify itself. Without information from data sequence, learning machine just has no information about what to modify. Learning machine will adapt itself only based on information from data sequence.

There are 2 kinds of data sequence. One is very well designed data sequence, i.e. we know the consequence of this data, and we can expect the outcome of learning. This is called teaching sequence. Another kind of data sequence is not teaching sequence. These data sequences are just outside data to drive the learning machine (could be random from outside). We have no much knowledge about them. Clearly, in order to learn certain target, if available, a teaching sequence is much more efficient. However, in most cases, we just do not have teaching sequence.

Universal Learning Machine Naturally, we will ask what a learning machine can learn? Can it learn anything? To address this, we need some careful defintion. Suppose we have a learning machine M. At the beginning, M has the processing P0, i.e. P0 is one mapping from input space (N-dim) to output space (M-dim). As the learning going, the processing will changed to P1, which is also one mapping from input space to output space, different one though. This is exactly what a learning machine does: its processing P is adapting. We then have following definition.

Definition 2.3 (Universal Learning Machine)

For a learning M, suppose its current processing is P0, and P1 is another processing, if we have one data sequence T (which depends on P0 and P1), so that when we apply T to M, at the end, the processing of M become P1, we say M can learn P1 starting from P0. If for any given processing P0 and P1, M can learn P1 starting from P0, we say M is an universal learning machine.

Simply say, an universal learning machine can learn anything starting from anything. Universal learning machine is desirable. But, clearly, not all learning machine are universal. So, we will discuss what properties can make a learning machine become universal.

In Theorem 1, we gave the relationship of N-M IPU and N-1 IPU. In order to discuss the relationship of N-M learning machine and N-1 learning machine, we need to introduce one property: standing for zero input. We say a learning machine M with property of standing for zero input, if M will do nothing for learning, i.e. doing nothing to modify its internal status, when input is zero vector (i.e. (0,0,…,0) ) and output side value is empty. Such a property for a learning machine should be very reasonable and very common. After all, zero input means no stimulation from outside, and it is very reasonable to require that learning machine should do nothing for such input.

Theorem 2.4

If we have one N-1 universal learning machine S with property of standing for zero input, we can use M independent S to construct a N-M universal learning machine M.

Proof: For simplicity and without loss of generality, we only consider the case of M=2. Now, S is a N-1 universal learning machine. As in theorem 1, we can construct a N-2 IPU M by this way: M = (S1,S2).

M is sure a N-2 learning machine. We only need to show it is universal learning machine. That is to say, for any given processing P0 and P1, there is one data sequence, and driven by the data sequence, M can learn P1 from P0.

Actually, we can design a data sequence as following: T is T1 followed by T2, where, T1=((D1,Z1), and T2=(Z2,D2), where D1 is the data sequence that drives S1 to learn P1 from P0, D2 is the data sequence that drives S2 to learn P1 from P0, and Z1,Z2 are the zero inputs. Since S1 and S2 are universal learning machine, D1,D2 indeed exist. We know the data sequence T1 followed by T2 indeed is the data sequence we want.

Of course, the data sequence T (T1 forllowed by T2) is far from optimal, and not desired in practice. But, here we just show the existence of such data sequence.

From theorem 1 and 2, we can see that without loss of generality, in many cases, we can focus on N-1 learning machine. From now on, we will mostly discuss N-1 learning machine.

Different Level of Learning Learning machine modifies its processing by data sequence. Obviously, there is some mechanism inside learning machine to do the learning. More specifically, this learning mechanism would catch information embedded inside data sequence, and use the information to modify its processing. But, we need to be very careful to distinguish 2 things: 1) the learning mechanism only modify the processing, and the learning mechanism itself is not modified; 2) the learning mechanism itself is also modified. But, how to describe these things well?

If M is an universal learning machine, so, for any giving 2 processing P0 and P1, we have one data sequence T so that, starting from P0, and by applying T to M, its processing becomes P1. This is clear. But, consider this, somehow, we apply some other data sequence so that the processing becomes P0 again. Since M is universal, this is allowed. But, we ask, what about if we apply data sequence T again? what would happen? Do we still have processing becomes P1? There is no guarantee for this. Actually, for many learning machine, this is not the case. However, if this is true, it indicate this: learning mechanism does not change as the processing is changing. This would be one important property. We use next definition to capture this property.

Definition 2.5 (Level 1 Learning Machine)

M is an universal learning machine, for any giving pair of processing P0 and P1, by definition, there is at least one data sequence T, so that, starting from P0, and by applying T to M, processing becomes P1. If the teaching sequence T will only depends on P0 and P1, and dose not depend on any history of processing of M, we call M as one level 1 universal learning machine.

Note, following this line of thoughts, we also can define level 0 learning machine, which is an IPU that its processing could not be changed. And, we also can define level 2 learning machine, which is a learning machine that its processing could change, and its learning mechanism could change as well, but its learning mechanism of learning mechanism could not be changed. We can actually follow this line, to define level J learning machine, J=0,1,2,…. But, we do not discuss in this direction. We will mostly consider level 1 learning machine.

Some Examples
Example 2.1 [Perceptron]
Perhaps, the simplest learning machine is the perceptron. Perceptron P is a 2-1 IPU, and it is a learning machine. However, it is not universal. As well known, P does not have AND gate and XOR gate. That is to say, no matter what, P could not learn these 2 processing .

Example 2.2 [RBM is learning machine]
See [4] for RBM. N-1 RBM is one N-1 IPU. It is a learning machine as well. There could be many ways to make it learn. The most common way is the so-called Gibbs sampling methods. We can see this clearly: Gibbs sampling is a simple set of rules, and the processing is modified as data is fed into. However, as we can see in Appendix, N-1 RBM is not universal.

Put M independent N-1 RBM together by the way in theorem 1, we get a N-M RBM. So, N-M RBM is one learning machine, but it is not universal.

Example 2.3 [Deep learning might be a learning machine]
Deep learning normally is a stack of RBM, see [4]. It is often formed in this way: first use data to train RBM at each layer, then stack different layers together, then use data to do further training. By the restricted sense, the whole deep learning action is not mechanical learning, since it involves a lot of human intervention. But, if we just see the stage after different layers stacked together, and exclude any further human intervention, it is a mechanical learning. So, in this sense, deep learning is a learning machine.

Example 2.4 [Deep learning might not be a learning machine]
But, these days, deep learning is much more than stacking RBM together then training without human intervention. There are a lot of pruning, change structure, adjusting done by human. Such learning is surely not mechanical learning. However, many properties can still be studied by point of view of mechanical learning.

Generally, we can say, for software to do learning, it often needs people to establish its very complicated structure and initial parameters. This establishment is not simple and fixed. But, once software is established, and is running without human intervention, such software is learning machine.

3 Pattern, Examples, Objective and Subjective

Incoming data drive learning. But, IPU and learning machine do not treat data bit-wise. They treat data as patterns. So, patterns are very important to learning machine. Everything of a learning machine is around patterns. Yet, pattern is also quite confusing. We can actually view pattern from different angles and get quite different results. We can view patterns objectively, i.e. totally independent from learning machine and learning, and we can view patterns subjectively, i.e. quite dependent on learning machine and its view on pattern. It is very important we clarify the concept here.

Examples of Patterns
Before going to more rigorous discussions, we here discuss some examples of patterns, which could help us to clean thoughts. The simplest patterns are 2-dim patterns.

Example 3.1 [All 2-dim Base Patterns]
2-dim patterns is so simple that we can list all of base patterns explicitly below:

PS02 = {(0,0),(0,1),(1,0),(1,1)}

All base patterns are here: totally 4 base patterns. For example, (0, 1) is a base pattern. But, besides base patterns, there are more patterns. How about this statement: ”incoming pattern is (0,0) or (0,1)”? Very clearly, what this statement describes is not in PS02. However, equally clearly, this statement is valid, and specifies an incoming pattern. We have solid reason to believe that the statement represents a new pattern that is not in base pattern space. So, the patterns should be able to include ”combination of patterns”. We can introduce one way to express this:

p = (0,0)⊔(0,1) = { one pattern that either (0,0) or (0,1) appears }

In above equation, the symbol ⊔ is called OR (see the similar usage of symbol in [6]). The combination operator ⊔ would make a new pattern out of 2 base patterns. Clearly, this new pattern is not in base pattern space. Additional important point: we should note that the new pattern p above is independent from learning machine.

Although in the above illustrations, the patterns are in 2-dim form, it is easy to see that all these patterns can be represented well in linear vector form (for example, the base pattern in Fig. 1 is (1, 1, 0, 1)). It is simple enough so that we can list them:

PS04 = {(0,0,0,0),(1,1,0,0),(0,0,1,1),(1,0,1,0),(0,1,0,1),⋯}

One pattern could be shown as the vector or as 2x2 image. For example, (1,0,1,0) is in vector form, the equivalent image is a vertical line. Let’s see some example of combination operators. We can view (1,1,0,0) as one horizontal line, and (0,1,0,1) as one vertical line. Consider this statement ”one pattern that has this horizontal line and also this vertical line”. Clearly, this is one new pattern. We try to capture it as below:

p = (1,1,0,0)⊓(0,1,0,1) = { one pattern that both (1,1,0,0) and (0,1,0,1) appears together }

The symbol ⊓ is called AND (see the similar usage of symbol in [6]). But, what is the new pattern p? First impression that it is the base patter: (1,1,0,1) (see it in Fig. 1). It is. This is a new base pattern out from 2 base patterns. How come? Yet, it could be even more complicated. We will address this later.

Now, we should note that the new pattern p above is surely dependent on learning machine and how it views patterns. Without learning machine and how it views patterns, we could not even talk about ”appears together”.

We will see another example of pattern but not base pattern. (1,1,0,0) is a base pattern. How about this statement: ”one pattern that (1,1,0,0) not appears”? This is one new pattern as well. We would have:

p = ¬(1,1,0,0) = { one pattern that (1,1,0,0) not appears }

The symbol ¬ is called NOT (see the similar usage of symbol in [6]). However, what is the new pattern? Is it a group of base patterns: {(0,0,1,1), (0,0,0,1), ⋯}? As the last question, this should be addressed later.

Besides the above situations, actually, we can see more interesting things (which could not be seen in PS02).

Example 3.4 [Abstraction and Concretization]
Let’s see this pattern:

ph = { common feature of (1,1,0,0) and (0,0,1,1) }

Clearly, this common feature is not in PS04. But, this common feature is one very important pattern: it represents horizontal line. Actually, we can say this pattern ph is horizontal line. Similarly, we have:

pv = { common feature of (1,0,1,0) and (0,1,0,1) }

This time, pv is vertical line. Further, we can see:

pl = { common feature of ph and pv }

This time, pl is line, vertical or horizontal. From the examples above, we can see clearly that abstracting a common feature out from a group of patterns is one very important operation. Without it, we simply could not see some very crucial patterns (such as line). Thus, we need to develop symbols for such operations. For example:

pl = α(ph,pv)

Here, α is one operation that abstract some common features out from the patterns ph and pv. Note, α is not one operator, but one operation. That means that for same set of patterns, could have more than one operations, which abstract different features from the set of patterns. As we meet more complicated patterns later, this properties would become very clearer.

Very clearly, the operation α is highly dependent on a learning machine and what the learning machine learned previously.

Conversely to abstraction operation α, we can also have concretization operation ρ. See examples below:

ρ is one operation that concretize a pattern (which is one abstraction pattern) by related it to some pattern.
Any concretization of a pattern is a pattern. As above, concretizing a horizontal line would give a real horizontal line. And, since it is related to (0,0,0,1), this horizontal line should be (0,0,1,1).

Very clearly, the operation α and ρ are highly dependent on a learning machine (such as: what the learning machine learned previously, how it views patterns, etc).

From above examples, we can see that patterns are much more than base patterns. We can have pattern of patterns (see horizontal lines, vertical lines). We can have pattern of patterns of patterns (see line). We can have operations on the patterns. We have operators of patterns. All results are still patterns. So, patterns are not just one type, it has many types. Or, we can say patterns are typeless. Base patterns are just simplest patterns and fundamental building blocks.

The binary vector space has 216 elements. This is a large number. While in theory we can still list all base patterns, it would be very hard.

PS016 = {⋯,(1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),⋯}

Since there is a larger dimension, more phenomenon would appears. We can see some of them here. Clearly, the binary vector shown in the above equation is one horizontal line. So, we can still have:

ph = α(first 2 horizontal lines)

Clearly, this pattern ph is not in PS016. But, it represents first 2 horizontal lines. Can this pattern ph, which abstracts first 2 horizontal line, represent all horizontal 4 lines? This is one very important question. At this moment, we can not answer it.

Similarly, we have:

pv = α(all vertical lines)

And, we ca have:

pl = α(ph,pv)

But, again, since we are dealing more complicated patterns space now, we can see something that Example 2 could not show. How about:

pc = { a point at coordination (3,3) }, pb = { a point at coordination (0,0) }

p0 = ρ(pv,pb)

This is concretization of vertical line related to point (0,0).

And, more:

p = p0⊓pc

This is one pattern with one vertical line and a point at (3,3). The pattern p is AND of 2 different types of patterns. This is one example that we have to make all operations and operators on patterns typeless.

Let’s try to put the above equations together, we then have:

p = ρ(α(all vertical lines), { a point at coordination (0,0) }) ⊓ { a point at coordination (3,3) }

Might be easier to just state: a vertical line pass through (0,0) and a point at (3,3). But, as we can see, the above equation describe the pattern much more precisely and mechanically (i.e. to avoid to use language, either natural language or programming language, just use our simple and mechanical terms: α, ρ, ⊔, ⊓, ¬).

We examined some simple examples above. Though simple, they are very revealing. From these examples, we can see some important properties of patterns. First, patterns are more than base patterns, much more. Second, some patterns together could generate new pattern. There are many ways to generate new patterns, such as OR, AND, NOT, abstraction, concretization, and more. Third, very crucially, we realize that some patterns are independent from learning machine, while some depend on learning machine heavily. In other words, for a learning machine, some patterns are objective, while some are subjective.

Pattern, Objectively First, we want to discuss pattern that is objective to learning machine. Base pattern is the foundation for all patterns. We defined it before. But, we repeat it again here for easy to cite.

Definition 3.1 (Base Pattern Space)

Each element of p∈PS0N is a base pattern. There are totally 2N patterns in PS0N. When N is not very small, PS0N is a huge set. Actually, this hugeness is the source of richness of world and fundamental reason of difficulty of learning.

Base pattern space is just the starting point of our discussion. From above examples, we know that many patterns are not base pattern. But, if a pattern is not base pattern, what is it? We can see in this angle: no matter what a pattern is, what is presented to input space of a learning machine is a base pattern. So, naturally, we have definition below.

Definition 3.2 (Pattern as Set of Base Patterns)

A N-dim pattern p is a set of base patterns:

p = { b1,b2,⋯|bi∈PS0N }

We can denote this set as pb, and call is as the base set of p (b stands for base). While we use p as the notation of a pattern, we understand it is a set of base patterns. If we want to emphasis it is a set of base patterns, we use notation pb. We also can write p=pb. Any base pattern in base set is called a base face of p (or just simply face). For example, in above, b2 is one face of p. Specially, any base pattern b is one pattern, and it is the (only) base face of itself.

According to this definition, a pattern is independent from learning machine, which is just a group of base patterns, no matter what a learning machine is. If we want to view pattern objectively, the only way is to define a pattern as a group of base patterns. So, objectively, a pattern is a set of base patterns.

What objective operators on objective patterns are? Since patterns are set of base patterns, naturally we first examine basic set operations: union, intersection, and complement.

Definition 3.3 (Operator OR (set union))

Based on any 2 patterns p1 and p2 , we have a new pattern p:

p = p1 OR p2 = p1b∪p2b

Here, ∪ is the set union. That is to say, this new pattern is such a pattern whose base set is the union set of base sets of 2 old patterns. Or, we can say, p is such a pattern whose face is either a face of p1 or a face of p2.

Definition 3.4 (Operator AND (set intersection))

For any 2 patterns p1 and p2 , we define a new pattern:

p = p1 AND p2 = p1b∩p2b

Here, ∩ is the set intersection. Or we can say, p is such a pattern that its face is both face of p1 and p2. In this sense, we say, p is both p1 and p2.

Definition 3.5 (Operator NOT (set complement) )

For any patterns p , we define a new pattern:

q = NOT p = pcb = {b∈PS0N|b∉pb}

Here, Ac is complement set of A. That is to say, q is such a pattern that its face is not a face of p.

Very clearly, the above 3 operators do not depend on learning machine. So, they are all objective. Consequently, if we apply these 3 operators consecutively any times, we still generate a new pattern that is objective.

Pattern, SubjectivelyNow we turn attention to subjective pattern, i.e. pattern to be viewed from a particular learning machine.

We need to go back for a while and consider basic. When we say there is an incoming pattern p to a learning machine, what do we mean? If we see this objectively, the meaning is clear: at input space, a binary vector is presented, which is a face of the incoming pattern p. This does not depend on learning machine at all. And, this is very clear and no unambiguity.

However, as our examples demonstrated, we have to consider pattern subjectively. We need to go slowly since there are a lot of confusing here. We have to consider something that is not valid at all objectively.

Pattern, 1-significant or 0-significant
First, when we discuss patterns subjectively, we need to know: Is 1 significant? or 0 significant, or both are equally significant?

Does this sound wrong? By definition, a base pattern is a binary vector, so, of course, both 0 and 1 would be equally significant. Why consider 1-significant, or 0-significant? Let’s consider one simple example. For 4-dim pattern, p1=(1,1,0,0) is one base pattern, and could be viewed as one horizontal line (see example 2 and Fig. 2). p2=(0,1,0,1) is also one base pattern, and could be viewed as one vertical line. When we talk about p1 and p2 appears together (or happen together), do we mean this pattern: (1, 1, 0, 1), or (0, 1, 0, 0)? Former one is 1-significant, and latter is 0-significant. So, if we want to use the term such as ”2 pattern happen together”, it is necessary to distinguish 1-significant and 0-significant.

So, to distinguish 1-significant pattern or 0-significant pattern indeed makes sense, and is necessary. When we consider a pattern as 1-significant, we often look at its 1 components, not pay much attention to its 0 components, just as we did in the example: ”(1, 1, 0, 1) equals (1, 1, 0, 0) and (0, 1, 0, 1) appear together”. Contrast, we do not think: ”(0, 1, 0, 0) equals (1, 1, 0, 0) and (0, 1, 0, 1) appear together”, since we do not consider 0-significant.

Additional to the above consideration, most patterns that people consider for many applications are sparse pattern, i.e. only a few bits in the pattern are 1, most are zero. For sparse patterns, 1-significant is very natural choice. In fact, in sparse pattern, 1-significant is very natural. Just see this example:

We would accept this statement easily.
From now on, unless we state explicitly, we will use 1-significant.

Patterns and Learning MachineFrom the examples, we know that one pattern p could be perceived very differently by different learning machine. This make us consider this question carefully: from the view of learning machine, what really is a patterns? We have not really addressed this crucial question yet, we just let our intuition play at the background. In Example 2, when we talk about ⊓ operator, and give an equation p = (1,1,0,0)⊓(1,0,1,0), we did not really tell what is this pattern p. Now we address this more carefully.

Take a look at this: { one pattern that both (1,1,0,0) and (1,0,1,0) appears together }. In our tuition, this is a right thought. However, if we see things objectively, this is simply wrong: Base patterns (1,1,0,0) and (1,0,1,0) cannot appears together. They are different base patterns. At one time, only one of them can appear. In this sense, ”together” cannot happen.

To address this question, we have to going deep to see what a pattern really is. When we talk about base pattern, i.e. binary vector in PS0N, there is no unambiguity. Everything is very clear. However, just base pattern is not good enough. With only base patterns, we simply cannot handle most things that we want to work with.

At this point, we should be able to realize that pattern is not only associated with what is presented at input space (surely base pattern), but also associated with how a learning machine perceives incoming pattern. For example, when base pattern (1,1,1,0) is at input, the learning machine could perceive it is just one base pattern, but also could perceive it as two base patterns (1,1,0,0) and (1,0,1,0) appear together, or could perceive much more, much more complicated.

So, naturally, a question arise: can we define pattern without introducing perception of learning machine? Yes, this can be done. Since no matter what pattern is, when a pattern is sent to learning machine, it is one base pattern at input space. In this way, we can surely define a pattern to be a set of base patterns. So, no matter what learning machine is and how it perceives, pattern is a set of base patterns. This is just objective pattern. For example, we can forcefully define { one pattern that both (1,1,0,0) and (1,0,1,0) appears together } as the set of base patter { (1,1,1,0) }. This is what we did in above section.

Seems this way resolves unambiguity. However, as all examples indicated, objective way cannot go far and we need to understand patterns subjectively. Pattern cannot be separated from how a learning machine perceives. Pattern defined as a set of base patterns is precise, but how a learning machine perceives patterns is much more important. Without learning machine perceiving, actually, no matter how precise a pattern is, it is not much useful.

Here, it is worth to stop and review our thoughts here. The major point is: learning machine plays an active role, and it must have its own way to see its outside world. More precisely, a learning machine must have the ability to tell what is outside of itself, and what is inside of itself, and what is its view to outside. With or without such ability is very critical. Only with this ability, the learning machine can go further and our later discussions can be conducted. It is very important we realize this. Without such ability, a learning machine is reduced to an ordinary computer program that is very hard to do learning. From now on, our learning machine will have such ability and we will make the ability more clearer. So, patterns would be mainly subjective to a learning machine.

Thus, we have to address this critical issue: how a learning machine perceives pattern? And we need to see this by considering relationship among patterns. We need to think these issues as well: 1) how to form new pattern from old pattern? 2) how to associate new pattern with prior learned patterns? 3) how to organize learned patterns? 4) how to re-organized learned patterns? In order to do these, we have to see how machine perceives.

How Learning Machine Perceives PatternsHow a learning machine perceives pattern is closely related to how it processes information. So we go back to IPU for a while. Consider a N-1 IPU M, suppose its processing is P. We define of black set:

Definition 3.6 (Black set of N-1 Ipu)

For a N-1 IPU M, if its processing is P, the black set of M is:

B = {b∈PS0N|P(b)=1}

Equivalently, we also call B as the black set of processing P.

For IPU M, suppose B is its black set, this means: if we put one base pattern b∈B to input space, M will process it to 1, if b∉B, to 0. This reveals one important fact to us: inside M, there must exist a bit pb with such property: if input b∈B, pb=1, if b∉B, pb=0.

We do not know exactly what inside M is, and we do not know how exactly the processing is done. However, we do know such a bit pb must be there. We do not know where this bit pb is and exists in what form, but we know it exists. Otherwise, how can M be able to distinguish input from B or not from B? Such bit pb reflects how M process input to output. We can imagine that M could have more such bits. So, we have definition.

Definition 3.7 (Processing Bits)

For IPU M, if it has some internal bit pb has such properties: there is a set B⊂PS0N so that for any b∈B, pb=1 (light), for any b∉B, pb=0 (dark), we call bit pb as one processing bit. If M has more than one such bit, say, pbj,j=1,…,L are all processing bits, we call such set as set of processing bits of M, or just simply, processing bits.

Theorem 3.8

For a IPU M, set of processing bits {pbj|j=1,…,L} exists and is not empty.

Proof: We will exclude 2 extreme cases, i.e. M maps all input to 0 vector (0,0,…,0), and M maps all input to 1 vector (1,1,…,1). After excluding the 2 extreme cases, we can say, black set B of M is a proper subset of PS0N, so does Bc. Thus, as we argued above, there must exist a bit pb inside M with such property: for b∈B, pb=1, for b∉B, pb=0. So, set of processing bits indeed exists and not empty.

In proof, we show that set of processing bits are not empty, at least there is bit in it. Such case indeed exists. There are IPU whose set of processing bits has only one elememt. But in most cases, set of processing bits has more than one element. In fact, L, the number of processing bits, can reflect the complexity of IPU. Processing bits reflects how processing of IPU is conducted.

Since a learning machine is also a IPU, it has processing bits as well. But, as we discussed before, how a learning machine perceives pattern is closely related to how it process input. So, for learning machine, we will call these bits as perception bits, instead of processing bits. When one base pattern is put into input, each perception bit will take its value. All these values together, we have perception values. Perception values reflects how a learning machine perceives this particular base pattern. If a learning machine is learning, its perception bits could chang, the number perception bits could increase or decrease, its behavior could change. Even the array of perception bits might not change, the behavior could change.

Armed with perception bits, we can describe how M perceives pattern. When a base pattern b is put into input space, perception bits act, some are light and some are dark. These bits reflect how b is perceived, i.e. the perception bits {pbj|j=1,…,L} are taking values, we have a binary vector pv=(pv1,pv2,…,pvL), where pvj is value (1 or 0) of pbj takes. We call them as perception values. Note, the perception values depends on a particular base pattern. The perception values tells how M perceives a base pattern b.

If b and b′ are 2 different base patterns, i.e. they are different bit-wise, but they have same perception values, we know that these 2 base patterns are perceived as same by M, since M has no way to tell any difference between b and b′. If 2 base patterns are possibly perceived different by M, their perception values must be different (at least one perception bit must behaves differently).

However, reverse is not true. It is possible that 2 base patterns b and b′ have different perception values, but M still could perceives b and b′ as same subjectively. That is to say, M can perceives 2 different base patterns as same even their perception values are different. So we have definition below.

Suppose M is a learning machine, {pbj|j=1,…,L} are perception bits, if for 2 base patterns b1 and b2, their perception values are (pv11,pv12,…,pv1L) and (pv21,pv22,…,pv2L), and at least for one k, pv1k=pv2k=1, we say, at perception bit pbk, M could subjectively perceives b1 and b2 as same.

That is to say, for 2 base patterns, if at any perception bit, their perception values are different, learning machine is not possible to perceive them same. But, if at least at one perception bit, their perception values are both 1, M could possibly perceive them as same subjectively. Of course, M could also perceive them as different subjectively. Note, perception value should be 1, not 0. This is related to 1-significant.

Suppose M is a learning machine, {pbj|j=1,…,L} are perception bits. And suppose p is a group of base pattern, and at perception bit pbk,1≤k≤L, perception value for all base patterns in p equals 1, then M could perceive all base patterns of p as same, and if so, we say M perceives p as one pattern subjectively at pbk, and p forms a subjective pattern.

Note, in definition, it only needs that all base patterns in p behaves same at one perception bit. This is minimal requirement. Of course, this requirement could be increased. For example, to require at all perception bits behaving same. But, all requirements are subjective.

Here we put down the major points about subjective patterns and how a learning machine to perceive them.

There are perception bits in a learning machine (only exclude 2 extreme cases). Any system that satisfies the definition of Iearning machine must have perception bits. How perception bits are formed and how exactly perception bits are realized inside a learning machine could be different greatly. But we emphasis that perception bits indeed exist.

These bits are very crucial for a learning machine. They reflect how learning machine perceive and process patterns. When a base pattern is put into input space of learning machine, then perception bits act and the learning machine uses these values to perceive pattern subjectively, and process pattern accordingly.

For learning machine, its perception bits are changing with learning. However, even the number of perception bits are not changing, the behavior of perception bits could change (so does the perception of learning machine).

Armed by perception bits, we can well understand subjective pattern. If 2 base patterns behave same at one perception bit, then, the 2 base patterns can be perceived as same at this perception bit subjectively. This can be extended to more than 2 base patterns. For a group of base patterns p, and if all base patterns behave same at one perception bit, then p can be perceived as same at this perception bit subjectively. This is the way to connect objective and subjective.

To consider pattern objectively, only need to involve set operation, no need to do any modification on learning machine itself. But, to consider pattern subjectively, set operation could be used. But more importantly, perception bits are needed. And, quite often, to modify perception bits is necessary. For subjective operator of subjective patterns, we need to base our discussion on perception bits.

Pattern, Subjective OperatorsJust as operators for objective patterns, it is naturally to consider subjective operators for subjective patterns. There are 3 very basic operators: NOT, OR, AND. First, consider NOT.

Definition 3.11 (Operator NOT for a Pattern, Subjectively)

Suppose M is a learning machine, {pbj|j=1,…,L} are perception bits. For a subjective pattern p perceived at pbk by M, q is another pattern perceived at pbk by M in this way: q are all such base patterns that is perceived by M, and at pbk, the perception value is 0.

We can denote this pattern q as q = NOT p or q=¬p. This notation ¬ is following [5]. We can also say, pattern q is a pattern that p does not appear.

Note, this operation NOT is subjective. q consists of base patterns that are perceived by M. So, this is quite different than the objective operation NOT (set complement). Another important point is: in order to do this operator, no need to modify perception bits of M, only perception value is different.

Now we turn attention to another operator OR. Consider that we have a subjective pattern p1, and the perception values of p1 are pv11,…, and subjective pattern p2, and the perception values of p2 are pv21,…. Since p1 and p2 are different pattern, their perception values must be different at some bits. Now, we want to put them together to form a new pattern, i.e. p=p1ORp2, which measn either p1 or p2. This action of course changes the perception of M and must change the perception. If the perception is not changed, there is no way to have OR. So, when we introduce the OR operator, we in fact change M. This is what subjective really means: learning machine changes its perception so that p1 and p2 are treated same, though p1 and p2 indeed have difference, and the difference is ignored.

Definition 3.12 (Operator OR for 2 Patterns, Subjectively)

Suppose M is a learning machine, {pbj|j=1,…,L} are perception bits. For any 2 subjective patterns p1 and p2, p1 perceived at pbk1 by M, and p2 perceived at pbk2 by M, p is another subjective pattern, and perceived by M in this way: first M will modify its perception bits if necessary, then M perceive any base patterns from either p1 or p2 at another perception bit pbl same. That is to say, if pbl does not exist, M will generate this perception bit first.

We can also say, new pattern p is either p1 or p2 appears. We can denote this new pattern as p=p1ORp2 = p1+sp2. This notation ¬ is following [3].

Note, if we want to do operation OR, we might need to modify perception bits of M. This is often done by adding a new perception bit. This is totally different from the objective OR (or set union). On surface, p1+p2 indeed is a union (set union) of p1 and p2. But, without modification of perception bits, there is no way to do this union.

Then consider subjective operator AND. This operator is crucially important. Actually, we spent a lot of time to argue about this operator, i.e. appears together.

Definition 3.13 (AND Operator for 2 Base Patterns, Subjectively)

Suppose M is a learning machine, {pbj|j=1,…,L} are perception bits. If p1 is one subjective patterns perceived at pbk1, p2 is one subjective patterns perceived at pbk2, then, all base patterns that M perceives at both pbk1 and pbk−2 at the same time will form another subjective pattern p, and p is perceived by M at pbl. That is to say, if pbl does not exist, M will generate this perception bit first.

We can also say, new pattern p is both p1 and p2 appear together. We can denote this pattern p as p=b1ANDb2 = b1⋅b2. This notation ⋅ is following [3].

Note, if we want to do AND operator, we have to modify perception bits of M. This is totally different from the objective AND (or set intersection).

X-FormWe have setup 3 subjective operators for subjective patterns. If applying the 3 operators consecutively, we will have one algebraic expression. Of course, in order this algebraic expression makes sense, learning machine needs to modify its perception bits. But, we want to know what we can construct from such algebraic expressions? First, we see some examples.

is one subjective pattern. We can say, this pattern is: either b1 or b2 and b3 happen together. However, the expression has more aspects. Since E is one algebraic expression, we can substitute base patterns into it, and get one value. This is actually what algebraic expression for. That is to say, E(b)=b1+b3 is one mapping on PS0N to {0, 1}, and it behaves like this: for any b∈PS0N, if b=b1 or b∈b2⋅b3, E(b)=1, otherwise E(b)=0. This matches our intuition well.

Example 3.5 [More X-forms ]
If g=b1,b2,…,bK is a group of base patterns, and we have some algebraic expressions, we get more subjective patterns based on g. See some example here:

b1+b3,b2+(¬b1+b2⋅b5),(b2+(b1⋅b2))⋅(b3+b4),…

are subjective pattern. But, also these expressions can be used to define a mapping on PS0N to {0, 1}, just like above.

Example 3.6 [Prohibition ]
If e1 and e2 are expressions, we want to find an expression for this situation: e2 prohibits e1, i.e. if e2 is light, output has to be dark, otherwise, output equals e1. This expression is:

¬(e1⋅e2)+(e1⋅¬e2)

are subjective pattern.

Above, each expression has 2 faces: first, it is one algebraic expression, second, it is one subjective pattern perceived by M. In order to make sense for these expressions, M has to modify its perception bits accordingly. This is crucial. Thus, we have following definition.

Definition 3.14 (X-Form for patterns)

If E is one algebraic expression of 3 subjective operators, and g={b1,b2,…,bK} is a group of base patterns, then we call the expression E(g)=E(b1,b2,…,bK) as an X-form upon g, or simply X-form. We note, in order to have this expression make sense, quite often, learning machine M needs to modify its perception bits accordingly. And, if this expression make sense, we then have a subjective pattern p=E(g)=E(b1,b2,…,bK).

The name X-form is chosen for reason: these expressions are forms, and we do not know them well, and X means unknown. In [3], there are similar form called conjunction normal form (CNF). Though, our expression are quite different than CNF of Valiant. CNF of Valiant is basically objective, while X-forms are subjective.

One important aspect of X-form is: it is one algebraic expression, so, we can substitute variable into and calculate to get output value, 0 or 1. See above examples. In this sense, one X-form would be a mapping on PS0N to {0, 1}. The calculation of this expression is actually same as learning machine doing processing inside itself. This is one wonderful property. This is exactly the reason why we introduce the construction of X-form. In this way, one X-form can be thought as one processing. Thus, we can also think one X-form has a black set, which is exactly equals the subjective pattern of this X-form.

In order to connect objective patterns, subject patterns, and X-form, we have following theorem.

Theorem 3.15

Suppose M is a N-1 learning machine. For any objective pattern po (i.e. a set of base patterns in PS0N), we can find some algebraic expression E upon some group of base patterns g=b1,b2,…,bK so that po=E(g). If so, we say X-form E(g) express po. In most cases, there are many X-form to express po. However, among those X-forms, we can find at least one so that it base upon no more than N base patterns, i.e. in g={b1,b2,…,bK}K≤N.

Proof: Suppose po is one objective pattern. It is easy to see there is one algebraic expression E can express po. Since po is a set of base patterns, surely we can write po as:

po={b1,b2,…,bL},bi∈PS0N

where each bi is a base pattern. The algebraic expression

E1(b1,b2,…,bL)=b1+b2+…+bL

can express po, since we can easily see po=E1(b1,b2,…,bL). If L is not bigger than N, we already find such a group of base pattern and such an algebraic expression, and proof is done.

If L is bigger than N, we can do further. For one base pattern b, we can find some other base patterns b′1,b′2,…,b′J, J≤N, and to express b in this way: b=b′1⋅b′2⋅…⋅b′J. Such b′1,b′2,… sure can be found. For example, if b=(1,1,0,…,0), we can find b′1=(1,0,…,0) and b′2=(0,1,0,…,0), then b=b′1⋅b′2.

For a group of base patterns, we can do same. That is to say, for b1,b2,…,bL, we can find There are at most N base patterns b′1,b′2,…,b′K, K≤N so that for each of bj,j=1,2,…,L, we can find some b′j1,…,b′jKj,Kj≤K, and b=b′j1⋅…⋅b′jKj. We know such a group base patterns indeed exists. For example, b′1=(1,0,…,0,0),…,b′N=(0,0,…,0,1) are such a group.

This algebraic expression E and a group of base patterns b′1,b′2,…,b′K, and K≤N, are what we are looking for. We should note, expression (3) is ”level 1” expression, while (1) is ”level 2” expression. We can do for higher level expressions.

Of course, the expression in the proof is just used to do the existence proof. It is not best expression. This expression is very ”shallow”. We can push the expression to higher level. But, here we do not discuss how to do so.

Theorem 4 tells the relationship of objective pattern and subjective pattern. For any objective pattern po, we can find a good group of base patterns (size of this group is as small as possible, at worst, not greater than N), and a good algebraic expression, to express this objective pattern as one subjective pattern.

Here is the major point. One objective pattern po is a set of base patterns. However, when po is perceived by a learning machine, learning machine generates a subjective pattern. The major question is: will the subjective one match with objective one? Theorem 4 confirm that, yes, for any objective pattern po, we always can find X-form to express po.

Naturally, next we would ask, how well such expression is? For ”how well”, we need some criteria. There could have many such criteria. However, this criteria is very important: use as less as possible base patterns, i.e. in po=E(b)=E(b1,b2,…,bK), K is as small as possible. There could have other important properties of X-form. To satisfy these properties, we can get a better X-form.

Of course, next question is how to really find or construct such X-form. That is what we do next.

Sub-FormSeveral X-form could form a new X-form. And, some part of a X-form is also a X-form. Such part could be quite useful. So, we discuss sub-form here.

Definition 3.16 (Sub-Form of a X-form)

Suppose e is a X-form, so, it is one algebraic expression E (of 3 subjective operations) upon a set of base patterns g={b1,b2,…,bK} so that e=E(g)=E(b1,b2,…,bK). A sub-form of e is one algebraic expression Es upon a subset of g, gs={bs1,…,bsJ}, J≤K, so that es=Es(gs)=Es(bs1,…,bsJ), and the objective pattern expressed by es is a proper subset of the objective pattern expressed by e.

So, by definition, a sub-form is also a X-form.

Example 3.7 [Sub-Form]
1. e=b1+b2 is one X-form. Both b1 and b2 are sub-form of e.
2. e=b1+b2⋅b2 is one X-form. Both b1 and b2⋅b3 are sub-form of e. But, b2 (or b3) is not sub-form of e.
3. e=(b1+b2)⋅(b1+b3) is one X-form. We can see that the black set of e is {b1,b1⋅b2,b1⋅b3,b2⋅b3}. So, b1 is sub-form of e, but b2,b3 are not.

One X-form e could have more than one sub-form. Or one X-form has no sub-form. For a sub-form, since it is one X-form, it could have sub-form for itself. So, we can have sub-forms for sub-forms, and go on. It is easy to see, any sub-form of sub-form is still a sub-form. So, X-form could have many sub-form. We denote all sub-forms as ei,i=1,2,…L. These sub-form are play important roles. They are actually fabric of processing.

4 Learning by Teaching

We now turn attentions to learning. We emphasis again that a learning machine is based on patterns, not bitwise, and the purpose of a learning machine is to process patterns and learn how to process.

Theorem 1 and Theorem 2 tell us, for simplicity and without loss of generality, we can just consider N-1 learning machine. For a N-1 learning machine, its processing is actually equivalent to its black set. We can also consider an objective pattern p, which is a set of base patterns. Thus, p can be thought as black set of one processing, and vise versa. This tells us, for a N-1 learning machine, its processing is equivalent to a objective pattern, called its black pattern. Obviously, black set and black pattern are equivalent. We can switch the 2 terms freely. By this understanding, we can define universal learning machine equivalently below.

Definition 4.1 (Universal Learning Machine (by Black Set))

For a N-1 learning M, if its current black set is B, and a given objective pattern p, M can start from B to learn and at the end of learning its black set becomes p, we call M can learn from B to p. If for any B and p, M can learn from B to p, we call M a universal N-1 learning machine.

For a N-1 learning M, it is easy to see definition 4.1 and definition 2.2 are equivalent.

Now, we turn attention to how to make a learning machine learn from B to p. It is easy to imagine, there are many possible ways to learn. Here, we discuss learning by teaching, that is to say, we can design a special data sequence T and apply it to the learning machine, then the machine learns effectively under the driven of T. We call T as teaching sequence. Teaching sequence is a specially designed data sequence.

It is easy to imagine, if we know the teaching sequence, learning by teaching is easy. Just feed the teaching sequence into, and learning is done. It is quite close to programming. But, learning by teaching can reveal interesting properties to us, and can guide us for further discussions.

Consider a teaching sequence T={(bi,oi)|i=1,2,…}. Here, output feedback oi could be empty, i.e. there is just no output feedback. Learning machine still could learn without output feedback. Of course, with output feedback, the learning will be more effective and efficient. Teaching sequence is the only information source for the machine. Learning machine will not get any other outside information besides teaching sequence. This is very essential.

The fundamental question would be: what kind properties of learning machine to make it universal? We will reduce this questions to see some capabilities of learning machine, and with these capabilities, machine is universal.

Note, one special case is: black set of M is empty set, we call it as empty state. This is one very useful case. There are some base patterns quite unique: b1=(1,0,…,0), b2=(0,1,…,0), bN=(0,0,…,1), i.e. these base patterns only has one component equals 1, and rest equals 0. We call such base patterns as elementary base patterns.

Definition 4.2 (Learning by Teaching - Capability 1)

For a learning machine M, the capability 1 is: for any elementary base pattern bj, j=1,2,…,N, there is one teaching sequence Tj,j=1,2,…,N, so that starting from empty state, driven by Tj, the black pattern become bj.

Definition 4.3 (Learning by Teaching - Capability 2)

For a learning machine M, the capability 2 is: for any black pattern p, there is at least one teaching sequences T, so that starting from p, driven by T, the black set becomes empty.

The capability 2 means: to forget current black pattern, can go back to empty state.

Definition 4.4 (Learning by Teaching - Capability 3)

For a learning machine M, the capability 3 is: for any 2 objective patterns p1 and p2, there is at least one teaching sequence Td, so that starting from p1, driven by Td, the black pattern becomes p1⋅p2; and there is at least one teaching sequence
Tp so that starting from p1, driven by Tp, the black pattern becomes p1+p2; and there is at least one teaching sequence
Tn so that starting from p1, driven by Tn, the black pattern becomes ¬p1;

Simply say, capability 3 means: for any 2 objective patterns p1,p2, learning machine is capable to learn subjective pattern of applying operator ”⋅”, ”+” to p1 and p2, and ”¬” to p1. This is the most crucial capability.

If one learning machine has all 3 capabilities, we expect a strong learning machine. Actually, we have following major theorem.

Theorem 4.5

If a N-1 learning machine M has the above 3 capabilities, it is an universal learning machine.

Proof: Since we have capability 2, we only need to consider the case: to start from empty state. That is to say, we only need to prove this: for any objective pattern p, we can find a teaching sequence T, so that starting from empty state, driven by T, the black pattern becomes p.

According to Theorem 4, for any objective pattern p, we can find an X-form E(b), where E is one algebraic expression, b is a group of elementary base patterns b={ei|i=1,…,K,K≤N}, so that p equals this X-form, i.e. p=E(e1,e2,…,eK).

By E, we can construct teaching sequence like this way:
1) First we have a teaching sequence T1 so that M go to empty state. This is using capability 1.
2) Then, have a teaching sequence T2 so that M have black pattern e1. This is using capability 2.
3) Since E is formed by finite steps of ⋅, ¬, and + starting from e1, we can use capability 3 consecutively to construct teaching sequence T for each operator in E. Eventually, we will get a teaching sequence over all operators in E.
Such teaching sequence T1+T2+T will drive M to p.

Note: The expression E depends on several things: the complexity of p, and to find an X-form E for p. In theorem 4, we demonstrated 2 level X-forms. We actually expect to have a much better X-form. The worst case would be: E=b1+b2+…, in which, the pattern p is so complicated that there is no way to find a X-form for higher level, so the only way is to just list all base patterns.

Corollary 4.5.1

If we have N-1 learning machine M with the above 3 capabilities, we then can use it to build one universal N-M learning machine.

This is just following Theorem 5 and Theorem 2. From Theorem 5 and corollary, we reduce the task to find university learning machine to find a N-1 learning machine with 3 capabilities. Once we can find a way to construct a N-1 learning machine with those 3 capabilities, we have an universal learning machine.

Also, it is easy to see that an universal learning machine surely has the 3 capabilities. Thus, the necessary and sufficient conditions for a learning machine to become universal are the 3 capabilities.

But, do we have one learning machine with those 3 capabilities? Well, it is up for us to design a concrete learning machine with the 3 capacities. We will do this in other places. Any way, the 3 capabilities will give us a clear guide on design of effective learning machine: The most essential capabilities for a learning machine is to find a way to move patterns to higher organized patterns. See the quotation at the front, most important step is: ”from a lower level to …… higher”. This indeed guides us well.

5 Learning without Teaching Sequence

Learning by teaching is a very special way to drive learning. From discussions in last section, we can see clearly, only when we have full knowledge of learning machine and the desired pattern, we could possibly design a teaching sequence. In this sense, learning by teaching is quite similar to programming – to inject the ability into the machine, not machine to learn by itself. Of course, learning by teaching is still a further step than programming, and it will bring us a lot more power to handle machines than just programming.

We focus on N-1 learning machine M.

Typical Mechanical Learning From examples of mechanical learning, typical mechanical learning would be as below:

For N-1 learning machine M, the learning target is often is given as an objective pattern po, M is expected to learn, and the learning result is that the black set of M become po.

To drive the mechanical learning, data sequence is fed into M. In learning by teaching, the data sequence is a specially designed teaching sequence. In learning without teaching, typically, data to feed into M are chosen from target objective pattern po, and from pco. In another word, it is sampling po.

Feed-in data will drive learning, i.e. the black set of M is changing. Hopefully, at some moment later, the black set Bt at the moment t becomes po, or at least Bt approximates po well.

Definition 5.1 (Typical Mechanical Learning)

to choose one sampling set Sin⊂po, normally, Sin is a much smaller set than po. But, in extreme case, could be Sin=po;

to choose another sampling set Sout⊂pco, i.e. all member in Sout is not in po. Sout is a much smaller set than pco. But, in extreme case, could be Sout=pco;

to use sampling set of Sin and Sout to form data sequence. In data sequence, data are (bi,oi),i=1,2,…, if bi∈Sin, oi is 1 or ∅ (empty), if bi∈Sout, oi is 0 or ∅ (empty).

to feed data sequence into M consecutively, we do not restrict how to feed, and how long to feed, and how often to feed, how to repeat feeding, which part to feed, etc.

The action above will drive M to learn. As the result of learning. its processing (equivalently, black set) is changing.

Remark: Sout could be empty, i.e. not sampling out of po. But, Sin is often not empty. However, if Sin is empty, Sout should not be empty. We will discuss this more in Data Sufficiency.

For such typical mechanical learning, what is happening in the learning process? To address this, first we want to examine learning machine.

Internal Representation Space
For a learning machine M, it has input space (N-dim binary array), and output space (M-dim binary array, but here M=1), and something between input space and output space. This something between is the major body of a learning machine, and we denote it as E. What is E? We have not discussed it yet. We need to carefully describe E and its essential properties.

At any point of learning, if we stop learning, then M is a IPU, i.e. it has processing F:PSN0→{0,1} at the moment. So we can say, at this moment, F uniquely defines something between input and output. Thus, at the moment, we can think, between input space and output space is F. Thus, it is quite reasonable to define E as the collection of all processing of M. And, we will give a better name to E: Internal Representation Space.

Definition 5.2 (Internal Representation Space)

For N-1 learning machine M, the major body of M that lays between input space and output space is called as internal representation space of M. At any moment, the processing of M is one member of this internal representation space. So, the internal representation space is the collection of all possible processing of M. We denote it as E.

Remark: All possible processing of E is 22N, an extremely huge number for not too small N. But for a particular learning machine, its internal representation space might be limited, not fully.

For N-1 learning machine, for any processing F, it is equivalent to its black set B. By theorem 4, there is at least one X-form (one algebraic expression E, and some base patterns b1,b2,…,bK) so that B=E(b1,b2,…,bK). We say that this X-form expresses processing F. Thus, naturally, we can think, the collection of all X-forms can be used to express the internal expression space. We have following definition.

Definition 5.3 (Internal Representation Space (X-form))

For N-1 learning machine M, the major body of M that lays between input space and output space is called as internal representation space of M. At any moment, one X-form expresses the processing of M, it is one member of this internal representation space. So, the internal representation space is the collection of all possible X-forms. We denote it as EX.

Remark, for one processing (which is equivalent to one black set), there is at least one X-form to express it.
Quite often, there are many X-forms to express one processing. So, the size of EX would be not less than the size of E. In fact, it is much larger. Learning sure is to get correct processing. However, to seek a good X-form that expresses the processing is more important. Thus, to use definition 5.3 (all X-forms as the internal representation space) is much better than to use definition 5.2. From now on, we will use definition 5.3. And, we just denote internal representation space as E.

Now, we can clearly say, learning is a dynamics on space E, from one X-form to another X-form. Or, we can say, learning is a flow on internal representation space.

One important note: No matter what a learning machine really is, if it satisfies the definition of learning machine, it must have internal representation space as we defined above. If we concretely design a learning machine, the internal representation space is designed by us explicitly, we know it well and can view its inside directly. If the learning machine is formed by different way, such as from a RBM (see Appendix), we could not view the inside directly. But, in theory, internal representation space indeed exists, and this space, equivalently, consists of a collection of X-forms. In theory, such space might be limited, not all X-forms, but only a part of the collection of all possible X-forms. This is not good. But, unfortunately, many learning machines are just so. However, when we discuss learning machine theoretically, the internal representation space is as definition 5.3.

Now we know that learning is a dynamics on internal representation space, moving from one X-form to another. But, how exactly?

Let’s make some notations. We have learning machine M, its input space, output space, and its internal representation space E, and a learning method LM. As in definition 5.1, we also have target pattern po, and data sequences {(bi,oi)|i=1,2,…}. Also assume the initial internal representation (one X-form) is e0∈E.

Now, we start learning. First, one base pattern b1∈S is feed into input space, and its feed-back value o1 is also feed into output space (o1 could be ∅ (empty), in that case, just has no feed-in to output space). Driven by this data, learning method LM moves internal representation from e0 to e1, which can be written as:

e1=LM(e0,b1,o1)

Here, LM is the learning method. Note, since the learning is mechanical, it is legible to write function form (if it is not mechanical, might not be justifiable to write in such function form). This is just the first step of learning. Next, we have: e2=LM(e1,b2,o2)=LM(e0,b1,o1,b2,o2). The process continues, we feed data (b1,o1),(b2,o2),…,(bk,ok) into input space consecutively, and we have:

ek=LM(e0,b1,o1,b2,o2,…,bk,ok),k=1,2,…

Note, the feed-in data could be repeating, i.e. could have bi=bj while i≠j.

With this process, as k increase, X-form ek continues to change, and we hope at some point, ek would be good enough for us. What is good enough? Perhaps, there are more than one criteria. For example, ”ek to express po”, i.e. the black set of ek equals the target pattern po. But, also, could be: ”ek to express a good approximation to po”. Or, additional to ”express po, some additional goals are posted, such as ek is based upon less base patterns, etc.

Yet, how do we know ek would make our hope become true? Several questions immediately pop up:

What is the mechanism of LM to make the ek approach po?

Is data sequence good enough? how to know data sequence is good enough?

We would first discuss sufficiency of data, then further discuss the learning mechanism.

Data SufficiencyLearning machine needs data, and data drives learning. More data, more driving. But, data are expensive. It would be nice to use less data to do more, if possible. More importantly, we need to understand what data are used for what purpose.

As we know already, learning is actually to get one good X-form. But, one X-form normally is a quite complicated and is quite hard to get. How can a mechanical learning method get it? Mechanical learning is not as smart like human, it only follow certain simple and fixed rules. In order to make a mechanical learning to get a complicated X-form, sufficient data are necessary. But, what are sufficient data? Good thing is that X-form itself gives a good description of such data.

We already know that an X-form and all its sub-forms give perception bits. This tells us that X-form and all its sub-forms describe the structure of black set. To tell one X-form, the least data necessary are 2: one is in the black set, another is not in the black set. Of course, just 2 data is not sufficient to describe a X-form. However, how about for each sub-form, we can find such pair of data, one is in, and one is out? It turns out, all such pairs are very good description for the X-form. This is why we have following definitions.

Definition 5.4 (Data Sufficient to Support a X-form)

Suppose e is a X-form, and suppose all sub-forms of e are: e1,e2,…,eL. For a set of base patterns DS⊂PS0N, if for any sub-form ej, there is at least one base patterns bj∈DS so that ej(bj)=0,e(bj)=1, we said data set DS is sufficient to support X-form e. That is to say, for each sub-form ej, we have a data bj that is in black set of e, but not in black set of ej.

When we do sampling as in definition 5.1, if the sampling includes data sufficient to support X-form e, then the data sequence D={(bi,oi)|i=1,2,…} will have such property: for each sub-form ej,j=1,…,L, there is at least one data (bi,1) in D so that ej(bi)=0. For such a kind of data sequence D, we say the data sequence is sufficient to support X-form e.

Data sufficient to support means: for each sub-form of a X-form, there is at least one data to tell learning machine, This is only a sub-form. It is good, but not good enough. With such information, learning method could conduct learning further mechanically.

Data sufficient to support a X-form is to provide information from inside a X-form. But, we also need information from outside a X-form. To do so, we will define data sufficient to bound a X-form. In order to say more easily, we make some terms first. For 2 X-form e and f, if b∈PS0N and e(b)=1 implies f(b)=1, we say f is over e (this is equivalent that the black set of f is greater than the black set of e). For 2 X-form e and f, if there is b∈PS0N so that e(b)=0 and f(b)=1, we say f is out boundary of e (this is equivalent to say the black set of f is not subset of the black set of e).

Definition 5.5 (Data Sufficient to Bound a X-form)

Suppose e is a X-form, and suppose all sub-forms of e are: e1,e2,…,eL. For one sub-form ej, if for any X-form f that is both over ej and out boundary of e, there is at least one b∈DS so that e(b)=0,f(b)=1, we call this data set DS as sufficient to bound e.

When we do sampling as in definition 5.1, if the sampling includes data sufficient to bound X-form e, then the data sequence D={(bi,oi)|i=1,2,…} will have such property: for each X-form e′ that is both over ej and out boundary of e, there is at least one data (bi,0) in D so that e′(bi)=1. For such a kind of data sequence D, we say it is sufficient to bound X-form e.

Data sufficiency to bound means: for any X-form that is over a sub-form, and out of boundary of e, there is at least one data b to tell learning machine, This X-form is not good, it is out of boundary. With such information, learning method could conduct learning further mechanically.

Examples of Data Sufficient to Support a X-form: 1. e=b1+b2 is one X-form. Its all sub-forms are b1 and b2. So, {(b1,1),b2} are data sufficient to support e.
2. e=b1⋅b2 is one X-form. e has no sub-form. Data set {b′} or {b′′} or {b′′′} are all data sufficient to support e.
3. e=b1+(b1⋅b2) is one X-form. Its all sub-forms are b1 and b1⋅b2. Data set {b1,b′} is data sufficient to support e. And, so do {b1,b′′}.

Learning Strategies and Learning Methods Again, learning is a dynamics of X-forms, from one X-form to another. X-form is complicated. How come such a dynamics reaches the desired X-form? Such dynamics is determined by learning methods, and learning strategies. We discussed learning methods above, which is described well in equation (5). Learning methods have set of rules on how to move from one X-form to another. Learning strategy is higher than learning method. It will govern these aspects: what X-forms to consider? what general approach to X-form? pre-set some X-forms? Or everything from scratch? etc. So, we can see that strategy governs method. Also, different strategy works for different kind of data. Different strategy also need different learning capabilities

We should emphasis here: learning is a complicated thing, one strategy and one method cannot fit all situations. There must be many strategies and even more methods. We are going to discuss some strategies and methods. But, still, there should have some common rules for these strategies and methods.

One very important property of X-form is: one processing (equivalently one black set) could be expressed by more than one X-form (normally, many). This property will play very important role in learning. Let’s see one simple example first. Consider a set of base patterns B:

B={b1,b2,…,bK}

B has totally K base patterns. What X-form could express B? The easiest one is:

e=b1+b2+…+bk

Sure e is one X-form to express B. Now, if we assume we can write b3,…,bK as some subjective expressions of b1 and b2, as following:

b3=E3(b1,b2),…,bK=EK(b1,b2)

So, we can further have:

e′=E3(b1,b2)+…+EK(b1,b2)=E′(b1,b2)

We can see X-form e and e′ express the same black set. But, the 2 X-forms are very different. In fact, e′ is more complicated than e, and with higher structure. But at the same time, e′ is upon much less base patterns, just b1 and b2, while e is upon on K base patterns.

This is very crucial: to learn e, we might have to use all base patterns b1,…,bK, while to learn e′, in principle, we might only use 2 base patterns b1,b2 (just might, might need more, depends on learning method). And, not only that, it is much more. e is just a collection of some base patterns, and no relationship between these base patterns are found and used, while e′ is built on many the relationship between base patterns (of course subjectively). In this sense, comparing to e, e