+ /* get_host_and_port may not return a port even when
+ * there is one: In the [host:port]:path case,
+ * get_host_and_port is called with "[host:port]" and
+ * returns "host:port" and NULL.
+ * In that specific case, we still need to split the
+ * port. */
if (!port)
port = get_port(ssh_host);

[PATCH v7 2/9] connect: call get_host_and_port() earlier

Currently, get_host_and_port() is called in git_connect() for the ssh
protocol, and in git_tcp_connect_sock() for the git protocol. Instead
of doing this, just call it from a single place, right after
parse_connect_url(), and pass the host and port separately to
git_*_connect() functions.

We however preserve hostandport, at least for now.

Note that in git_tcp_connect_sock, the port was set to "<none>" in a
case that never can happen, so that code path was removed.

The last uses of the hostandport variable, besides being strdup'ed
before being split into host and port, is to fill the host header in the
git protocol and to test whether to proxy the request.

Instead of relying on parse_connect_url() to return a host:port string
that makes sense there, re-derive one from the host and port variables.
This will allow to refactor parse_connect_url() to return separate host
and port strings.

Now that nothing besides CONNECT_DIAG_URL is using hostandport, we can
have parse_connect_url() itself do the host and port splitting.

This still leaves "user@" part of the host, if there is one, which will
be addressed in a subsequent change. This however does add /some/
handling of the "user@" part of the host, in order to pick the port
properly.

- if (*host && !port) {
- /* get_host_and_port may not return a port even when there is
- * one: In the [host:port]:path case, get_host_and_port is
- * called with "[host:port]" and returns "host:port" and NULL.
- * In that specific case, we still need to split the port.
- * "host:port" may also look like "user@host:port". As the
- * `user` portion tends to be less strict than `host:port`,
- * we first put it out of the equation: since a hostname
- * cannot contain a '@', we start from the last '@' in the
- * string. */
+ if (*host) {
+ /* At this point, the host part may look like user@host:port.
+ * As the `user` portion tends to be less strict than
+ * `host:port`, we first put it out of the equation: since a
+ * hostname cannot contain a '@', we start from the last '@'
+ * in the string. */
char *end_user = strrchr(host, '@');
- port = get_port(end_user ? end_user : host);
+ if (end_user) {
+ *end_user = '\0';
+ user = host;
+ host = end_user + 1;
+ }
}
+ /* get_host_and_port may not have returned a port even when there is
+ * one: In the [host:port]:path case, get_host_and_port is called
+ * with "[host:port]" and returns "host:port" and NULL.
+ * In that specific case, we still need to split the port. */
+ if (!port)
+ port = get_port(host);

[PATCH v7 8/9] connect: actively reject git:// urls with a user part

Currently, urls of the for git://user@host don't work because user@host
is not resolving at the DNS level, but we shouldn't be relying on it
being an invalid host name, and actively reject it for containing a
username in the first place.

+ if (user)
+ die("user@host is not allowed in git:// urls");
+
/* If the host contains a colon (ipv6 address), it needs to
* be enclosed with square brackets. */
if (colon)
--
2.8.3.401.ga81c606.dirty

I don't know.
urls.txt is for Git users, and connect.c is for Git developers.
urls.txt does not mention that Git follows any RFC when parsing the
URLS', it doesn't claim to be 100% compliant.
Even if it makes sense to do so, as many user simply expect Git to accept
RFC compliant URL's, and it makes the development easier, if there is an
already
written specification, that describes all the details.
The parser is not 100% RFC compliant, one example:
- old-style usgage like "git clone [host:222]:~/path/to/repo are supported
To easy the live for developers, it could make sense why the code is as
it is,
simply to avoid that people around the world suddenly run into problems,
when they upgrade the Git version.
So if there is a comment in the code, it could help new developers to
understand
things easier.

>>>> get_host_and_port(&ssh_host, &port);
>>>> + /* get_host_and_port may not return a port
>>>> even when
>>>> + * there is one: In the [host:port]:path case,
>>>> + * get_host_and_port is called with "[host:port]" and
>>>> + * returns "host:port" and NULL.
>>>> + * In that specific case, we still need to split the
>>>> + * port. */
>>> Is it worth to mention that this case is "still supported legacy" ?
>> If it's worth mentioning anywhere, it seems to me it would start with
>> urls.txt?
>>
>> Mike
>>
> I don't know.
> urls.txt is for Git users, and connect.c is for Git developers.
> urls.txt does not mention that Git follows any RFC when parsing the
> URLS', it doesn't claim to be 100% compliant.
> Even if it makes sense to do so, as many user simply expect Git to accept
> RFC compliant URL's, and it makes the development easier, if there is
> an already
> written specification, that describes all the details.
> The parser is not 100% RFC compliant, one example:
> - old-style usgage like "git clone [host:222]:~/path/to/repo are supported

Is it an option to fix get_host_and_port() so that it returns what
the caller expects even when it is given "[host:port]"? When the
caller passes other forms like "host:port", it expects to get "host"
and "port" parsed out into two variables. Why can't the caller
expect to see the same happen when feeding "[host:port]" to the
function?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]More majordomo info at http://vger.kernel.org/majordomo-info.html

> Torsten Bögershausen <[hidden email]> writes:
>
> >>>> get_host_and_port(&ssh_host, &port);
> >>>> + /* get_host_and_port may not return a port
> >>>> even when
> >>>> + * there is one: In the [host:port]:path case,
> >>>> + * get_host_and_port is called with "[host:port]" and
> >>>> + * returns "host:port" and NULL.
> >>>> + * In that specific case, we still need to split the
> >>>> + * port. */
> >>> Is it worth to mention that this case is "still supported legacy" ?
> >> If it's worth mentioning anywhere, it seems to me it would start with
> >> urls.txt?
> >>
> >> Mike
> >>
> > I don't know.
> > urls.txt is for Git users, and connect.c is for Git developers.
> > urls.txt does not mention that Git follows any RFC when parsing the
> > URLS', it doesn't claim to be 100% compliant.
> > Even if it makes sense to do so, as many user simply expect Git to accept
> > RFC compliant URL's, and it makes the development easier, if there is
> > an already
> > written specification, that describes all the details.
> > The parser is not 100% RFC compliant, one example:
> > - old-style usgage like "git clone [host:222]:~/path/to/repo are supported

"This parser handles the urls described in urls.txt" is about as much as
there is to say IMHO.

> Is it an option to fix get_host_and_port() so that it returns what
> the caller expects even when it is given "[host:port]"? When the
> caller passes other forms like "host:port", it expects to get "host"
> and "port" parsed out into two variables. Why can't the caller
> expect to see the same happen when feeding "[host:port]" to the
> function?

After this series, there's only one use of get_host_and_port(). As well
as one use of host_end() and get_port(). They become implementation
details of the parse_connect_url function. And as I mentioned in the
cover letter this round, parse_connect_url could be further modified to
avoid the kind of back-and-forth that's going on between these
functions. But I'd rather leave that for later.

> Torsten Bögershausen <[hidden email]> writes:
>
>>>>> get_host_and_port(&ssh_host, &port);
>>>>> + /* get_host_and_port may not return a port
>>>>> even when
>>>>> + * there is one: In the [host:port]:path case,
>>>>> + * get_host_and_port is called with "[host:port]" and
>>>>> + * returns "host:port" and NULL.
>>>>> + * In that specific case, we still need to split the
>>>>> + * port. */
>>>> Is it worth to mention that this case is "still supported legacy" ?
>>> If it's worth mentioning anywhere, it seems to me it would start with
>>> urls.txt?
>>>
>>> Mike
>>>
>> I don't know.
>> urls.txt is for Git users, and connect.c is for Git developers.
>> urls.txt does not mention that Git follows any RFC when parsing the
>> URLS', it doesn't claim to be 100% compliant.
>> Even if it makes sense to do so, as many user simply expect Git to accept
>> RFC compliant URL's, and it makes the development easier, if there is
>> an already
>> written specification, that describes all the details.
>> The parser is not 100% RFC compliant, one example:
>> - old-style usgage like "git clone [host:222]:~/path/to/repo are supported
> Is it an option to fix get_host_and_port() so that it returns what
> the caller expects even when it is given "[host:port]"? When the
> caller passes other forms like "host:port", it expects to get "host"
> and "port" parsed out into two variables. Why can't the caller
> expect to see the same happen when feeding "[host:port]" to the
> function?
>

This is somewhat out of my head:
git clone git://[example.com:123]:/test #illegal, malformated URL
git clone [example.com:123]:/test #scp-like URL, legacy
git clone ssh://[example.com:123]:/test #illegal, but supported
as legacy, because
git clone ssh://[user@::1]/test # was the only
way to specify a user name at a literal IPv6 address

May be we should have some more test cases for malformated git:// URLs?

Note in the first case, hostandport is "[example.com:123]:", and that
is treated as host=example.com:123 and port=NULL further down, so my
series is changing something here, but arguably, it makes it less worse.
(note that both with and without my series,
"git://[example.com:123]:42/path" is treated the same, so only a corner
case changed)

Can we go forward with the current series (modulo the comment style
thing Eric noted, and maybe adding a note about the parser handling urls
as per urls.txt), and not bloat scope it? If anything, the state of the
code after the series should make further parser changes easier.

> On Tue, May 24, 2016 at 06:44:26AM +0200, Torsten Bögershausen wrote:
>> On 05/23/2016 11:30 PM, Junio C Hamano wrote:
>>> Torsten Bögershausen <[hidden email]> writes:
>>>
>>>>>>> get_host_and_port(&ssh_host, &port);
>>>>>>> + /* get_host_and_port may not return a port
>>>>>>> even when
>>>>>>> + * there is one: In the [host:port]:path case,
>>>>>>> + * get_host_and_port is called with "[host:port]" and
>>>>>>> + * returns "host:port" and NULL.
>>>>>>> + * In that specific case, we still need to split the
>>>>>>> + * port. */
>>>>>> Is it worth to mention that this case is "still supported legacy" ?
>>>>> If it's worth mentioning anywhere, it seems to me it would start with
>>>>> urls.txt?
>>>>>
>>>>> Mike
>>>>>
>>>> I don't know.
>>>> urls.txt is for Git users, and connect.c is for Git developers.
>>>> urls.txt does not mention that Git follows any RFC when parsing the
>>>> URLS', it doesn't claim to be 100% compliant.
>>>> Even if it makes sense to do so, as many user simply expect Git to accept
>>>> RFC compliant URL's, and it makes the development easier, if there is
>>>> an already
>>>> written specification, that describes all the details.
>>>> The parser is not 100% RFC compliant, one example:
>>>> - old-style usgage like "git clone [host:222]:~/path/to/repo are supported
>>> Is it an option to fix get_host_and_port() so that it returns what
>>> the caller expects even when it is given "[host:port]"? When the
>>> caller passes other forms like "host:port", it expects to get "host"
>>> and "port" parsed out into two variables. Why can't the caller
>>> expect to see the same happen when feeding "[host:port]" to the
>>> function?
>>>
>> This is somewhat out of my head:
>> git clone git://[example.com:123]:/test #illegal, malformated URL
>> git clone [example.com:123]:/test #scp-like URL, legacy
>> git clone ssh://[example.com:123]:/test #illegal, but supported as
>> legacy, because
>> git clone ssh://[user@::1]/test # was the only way to
>> specify a user name at a literal IPv6 address
>>
>> May be we should have some more test cases for malformated git:// URLs?
>
> None of these malformed urls are rejected with or without my series
> applied:
>
> Without:
> $ git fetch-pack --diag-url git://[example.com:123]:/test
> Diag: url=git://[example.com:123]:/test
> Diag: protocol=git
> Diag: hostandport=[example.com:123]:
> Diag: path=/test
> $ git fetch-pack --diag-url
> ssh://[example.com:123]:/test
> Diag: url=ssh://[example.com:123]:/test
> Diag: protocol=ssh
> Diag: userandhost=example.com
> Diag: port=123
> Diag: path=/test
>
> With:
> $ git fetch-pack --diag-url git://[example.com:123]:/test
> Diag: url=git://[example.com:123]:/test
> Diag: protocol=git
> Diag: user=NULL
> Diag: host=example.com
> Diag: port=123
> Diag: path=/test
> $ git fetch-pack --diag-url ssh://[example.com:123]:/test
> Diag: url=ssh://[example.com:123]:/test
> Diag: protocol=ssh
> Diag: user=NULL
> Diag: host=example.com
> Diag: port=123
> Diag: path=/test
>
> Note in the first case, hostandport is "[example.com:123]:", and that
> is treated as host=example.com:123 and port=NULL further down, so my
> series is changing something here, but arguably, it makes it less worse.
> (note that both with and without my series,
> "git://[example.com:123]:42/path" is treated the same, so only a corner
> case changed)
>
> Can we go forward with the current series (modulo the comment style
> thing Eric noted, and maybe adding a note about the parser handling urls
> as per urls.txt), and not bloat scope it? If anything, the state of the
> code after the series should make further parser changes easier.
>
> Cheers,
>
> Mike
>

Thanks for digging.

How about something like this:

/*
* get_host_and_port may not return a port in the [host:port]:path case.
* To support this undocumented legacy we still need to split the port.
*/