[PATCH 1 of 6 stable] convert: bail out in Subversion source if encountering non-ASCII HTTP(S) URL

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH 1 of 6 stable] convert: bail out in Subversion source if encountering non-ASCII HTTP(S) URL

Manuel Jacob
# HG changeset patch
# User Manuel Jacob <[hidden email]>
# Date 1593485752 -7200
#      Tue Jun 30 04:55:52 2020 +0200
# Branch stable
# Node ID 2d738426309a0804a44c2024b97cef11aece6dde
# Parent  bd0f122f3f51939b2adc7d6f9d8611692b1ebdce
# EXP-Topic svn_encoding
convert: bail out in Subversion source if encountering non-ASCII HTTP(S) URL

Before this patch, in the tested case, urllib raised `httplib.InvalidURL: URL
can't contain control characters. '/\xff/!svn/ver/0/.svn' (found at least
'\xff')`, which resulted in that the URL was never recognized as a Subversion
repository.

This patch adds a check that bails out if the URL contains non-ASCII characters.
The warning is not overly user-friendly, but giving the user something to type
into a search engine is definitively better than not explaining why the
repository was not recognized.

We could support non-ASCII chracters by quoting them before passing them to
urllib. However, we would want to be compatible with what the `svn` command
does, which converts the URL from the locale encoding to UTF-8, percent-encodes
it and sends it to the server. If the locale encoding is not UTF-8, the
behavior is IMHO not very intuitive, as the `svn` command may send different
(percent-encoded) octets than what was passed on the console. Instead of
copying this behavior, we better leave it forbidden.

diff --git a/hgext/convert/subversion.py b/hgext/convert/subversion.py
--- a/hgext/convert/subversion.py
+++ b/hgext/convert/subversion.py
@@ -347,6 +347,15 @@
         path = os.path.abspath(url)
     if proto == b'file':
         path = util.pconvert(path)
+    elif proto in (b'http', 'https'):
+        if not encoding.isasciistr(path):
+            ui.warn(
+                _(
+                    b"Subversion sources don't support non-ASCII characters in "
+                    b"HTTP(S) URLs. Please percent-encode them.\n"
+                )
+            )
+            return False
     check = protomap.get(proto, lambda *args: False)
     while b'/' in path:
         if check(ui, path, proto):
diff --git a/tests/test-convert-svn-encoding.t b/tests/test-convert-svn-encoding.t
--- a/tests/test-convert-svn-encoding.t
+++ b/tests/test-convert-svn-encoding.t
@@ -153,6 +153,16 @@
 
   $ cd ..
 
+Subversion sources don't support non-ASCII characters in HTTP(S) URLs.
+
+  $ XFF=$($PYTHON -c 'from mercurial.utils.procutil import stdout; stdout.write(b"\xff")')
+  $ hg convert --source-type=svn <a href="http://localhost:$HGPORT/$XFF">http://localhost:$HGPORT/$XFF test
+  initializing destination test repository
+  Subversion sources don't support non-ASCII characters in HTTP(S) URLs. Please percent-encode them.
+  <a href="http://localhost:$HGPORT/">http://localhost:$HGPORT/\xff does not look like a Subversion repository (esc)
+  abort: <a href="http://localhost:$HGPORT/">http://localhost:$HGPORT/\xff: missing or unsupported repository (esc)
+  [255]
+
 #if py3
 For now, on Python 3, we abort when encountering non-UTF-8 percent-encoded
 bytes in a filename.

_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

[PATCH 2 of 6 stable] py3: pass URL as str

Manuel Jacob
# HG changeset patch
# User Manuel Jacob <[hidden email]>
# Date 1593486276 -7200
#      Tue Jun 30 05:04:36 2020 +0200
# Branch stable
# Node ID 4d00ac33053273ed2b9a6431800d59df94adcfc3
# Parent  2d738426309a0804a44c2024b97cef11aece6dde
# EXP-Topic svn_encoding
py3: pass URL as str

Before the patch, HTTP(S) URLs were never recognized as a Subversion repository
on Python 3.

diff --git a/hgext/convert/subversion.py b/hgext/convert/subversion.py
--- a/hgext/convert/subversion.py
+++ b/hgext/convert/subversion.py
@@ -284,7 +284,9 @@
 def httpcheck(ui, path, proto):
     try:
         opener = urlreq.buildopener()
-        rsp = opener.open(b'%s://%s/!svn/ver/0/.svn' % (proto, path), b'rb')
+        rsp = opener.open(
+            pycompat.strurl(b'%s://%s/!svn/ver/0/.svn' % (proto, path)), b'rb'
+        )
         data = rsp.read()
     except urlerr.httperror as inst:
         if inst.code != 404:

_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

[PATCH 3 of 6 stable] convert: correctly convert paths to UTF-8 for Subversion

Manuel Jacob
In reply to this post by Manuel Jacob
# HG changeset patch
# User Manuel Jacob <[hidden email]>
# Date 1593435816 -7200
#      Mon Jun 29 15:03:36 2020 +0200
# Branch stable
# Node ID 7bd48930ea77337213859f562e2fc0abd6734830
# Parent  4d00ac33053273ed2b9a6431800d59df94adcfc3
# EXP-Topic svn_encoding
convert: correctly convert paths to UTF-8 for Subversion

The previous code using encoding.tolocal() only worked by chance in these
situations:

* The string is ASCII: The fast path was triggered and the string was returned
  unmodified.
* The local encoding is UTF-8: The source and target encoding is the same.
* The string is not valid UTF-8 and the native encoding is ISO-8859-1: If the
  string doesn’t decode using UTF-8, ISO-8859-1 is tried as a fallback. During
  `hg convert`, the local encoding is always UTF-8. The irony is that in this
  case, encoding.tolocal() behaves like what someone would expect the reverse
  function, encoding.fromlocal(), to do.

When the locale encoding is ISO-8859-15, trying to convert a SVN repo `/tmp/a€`
failed before like this:

file:///tmp/a%C2%A4 does not look like a Subversion repository to libsvn version 1.14.0

The correct URL is `file:///tmp/a%E2%82%AC`.

Unlike previously (with the ISO-8859-1 fallback), decoding the path using the
locale encoding can fail. In this case, we have to bail out, as Subversion
won’t be able to do anything useful with the path.

diff --git a/hgext/convert/subversion.py b/hgext/convert/subversion.py
--- a/hgext/convert/subversion.py
+++ b/hgext/convert/subversion.py
@@ -3,6 +3,8 @@
 # Copyright(C) 2007 Daniel Holth et al
 from __future__ import absolute_import
 
+import codecs
+import locale
 import os
 import re
 import xml.dom.minidom
@@ -63,6 +65,38 @@
     svn = None
 
 
+# In Subversion, paths are Unicode (encoded as UTF-8), which Subversion
+# converts from / to native strings when interfacing with the OS. When passing
+# paths to Subversion, we have to recode them such that it roundstrips with
+# what Subversion is doing.
+
+fsencoding = None
+
+
+def init_fsencoding():
+    global fsencoding, fsencoding_is_utf8
+    if fsencoding is not None:
+        return
+    if pycompat.iswindows:
+        # On Windows, filenames are Unicode, but we store them using the MBCS
+        # encoding.
+        fsencoding = 'mbcs'
+    else:
+        # This is the encoding used to convert UTF-8 back to natively-encoded
+        # strings in Subversion 1.14.0 or earlier with APR 1.7.0 or earlier.
+        with util.with_lc_ctype():
+            fsencoding = locale.nl_langinfo(locale.CODESET) or 'ISO-8859-1'
+    fsencoding = codecs.lookup(fsencoding).name
+    fsencoding_is_utf8 = fsencoding == codecs.lookup('utf-8').name
+
+
+def fs2svn(s):
+    if fsencoding_is_utf8:
+        return s
+    else:
+        return s.decode(fsencoding).encode('utf-8')
+
+
 class SvnPathNotFound(Exception):
     pass
 
@@ -117,7 +151,7 @@
             path = b'/' + util.normpath(path)
         # Module URL is later compared with the repository URL returned
         # by svn API, which is UTF-8.
-        path = encoding.tolocal(path)
+        path = fs2svn(path)
         path = b'file://%s' % quote(path)
     return svn.core.svn_path_canonicalize(path)
 
@@ -347,6 +381,17 @@
     except ValueError:
         proto = b'file'
         path = os.path.abspath(url)
+        try:
+            path.decode(fsencoding)
+        except UnicodeDecodeError:
+            ui.warn(
+                _(
+                    b'Subversion requires that paths can be converted to '
+                    b'Unicode using the current locale encoding (%s)\n'
+                )
+                % pycompat.sysbytes(fsencoding)
+            )
+            return False
     if proto == b'file':
         path = util.pconvert(path)
     elif proto in (b'http', 'https'):
@@ -384,6 +429,7 @@
     def __init__(self, ui, repotype, url, revs=None):
         super(svn_source, self).__init__(ui, repotype, url, revs=revs)
 
+        init_fsencoding()
         if not (
             url.startswith(b'svn://')
             or url.startswith(b'svn+ssh://')
diff --git a/tests/test-convert-svn-encoding.t b/tests/test-convert-svn-encoding.t
--- a/tests/test-convert-svn-encoding.t
+++ b/tests/test-convert-svn-encoding.t
@@ -163,6 +163,26 @@
   abort: <a href="http://localhost:$HGPORT/">http://localhost:$HGPORT/\xff: missing or unsupported repository (esc)
   [255]
 
+In Subversion, paths are Unicode (encoded as UTF-8). Therefore paths that can't
+be converted between UTF-8 and the locale encoding (which is always ASCII in
+tests) don't work.
+
+  $ cp -R svn-repo $XFF
+  $ hg convert $XFF test
+  initializing destination test repository
+  Subversion requires that paths can be converted to Unicode using the current locale encoding (ascii)
+  \xff does not look like a CVS checkout (glob) (esc)
+  $TESTTMP/\xff does not look like a Git repository (esc)
+  \xff does not look like a Subversion repository (glob) (esc)
+  \xff is not a local Mercurial repository (glob) (esc)
+  \xff does not look like a darcs repository (glob) (esc)
+  \xff does not look like a monotone repository (glob) (esc)
+  \xff does not look like a GNU Arch repository (glob) (esc)
+  \xff does not look like a Bazaar repository (glob) (esc)
+  cannot find required "p4" tool
+  abort: \xff: missing or unsupported repository (glob) (esc)
+  [255]
+
 #if py3
 For now, on Python 3, we abort when encountering non-UTF-8 percent-encoded
 bytes in a filename.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

[PATCH 4 of 6 stable] convert: convert URLs to UTF-8 for Subversion

Manuel Jacob
In reply to this post by Manuel Jacob
# HG changeset patch
# User Manuel Jacob <[hidden email]>
# Date 1593487847 -7200
#      Tue Jun 30 05:30:47 2020 +0200
# Branch stable
# Node ID 388217b58dc5aa2e4b34b8fab67c17016abcc005
# Parent  7bd48930ea77337213859f562e2fc0abd6734830
# EXP-Topic svn_encoding
convert: convert URLs to UTF-8 for Subversion

Preamble: for comprehension, note that the `path` of geturl() would better be
called `path_or_url` (the argument of the call of getsvn() is called `url`).

For HTTP(S) URLs, the changes don’t make a difference, as they are restricted to
ASCII.

For file URLs, the reasoning is the same as for paths: we have to roundtrip with
what Subversion is doing.

When the locale encoding is ISO-8859-15, trying to convert a SVN repo
`file:///tmp/a€` failed before like this:

file:///tmp/a%A4 does not look like a Subversion repository to libsvn version 1.14.0

Decoding the path using the locale encoding can fail. In this case, we have to
bail out, as Subversion won’t be able to do anything useful with the path.

diff --git a/hgext/convert/subversion.py b/hgext/convert/subversion.py
--- a/hgext/convert/subversion.py
+++ b/hgext/convert/subversion.py
@@ -65,10 +65,10 @@
     svn = None
 
 
-# In Subversion, paths are Unicode (encoded as UTF-8), which Subversion
-# converts from / to native strings when interfacing with the OS. When passing
-# paths to Subversion, we have to recode them such that it roundstrips with
-# what Subversion is doing.
+# In Subversion, paths and URLs are Unicode (encoded as UTF-8), which
+# Subversion converts from / to native strings when interfacing with the OS.
+# When passing paths and URLs to Subversion, we have to recode them such that
+# it roundstrips with what Subversion is doing.
 
 fsencoding = None
 
@@ -141,7 +141,9 @@
 
 def geturl(path):
     try:
-        return svn.client.url_from_path(svn.core.svn_path_canonicalize(path))
+        return svn.client.url_from_path(
+            svn.core.svn_path_canonicalize(fs2svn(path))
+        )
     except svn.core.SubversionException:
         # svn.client.url_from_path() fails with local repositories
         pass
@@ -358,6 +360,19 @@
                 and path[2:6].lower() == b'%3a/'
             ):
                 path = path[:2] + b':/' + path[6:]
+            try:
+                path.decode(fsencoding)
+            except UnicodeDecodeError:
+                ui.warn(
+                    _(
+                        b'Subversion requires that file URLs can be converted '
+                        b'to Unicode using the current locale encoding (%s)\n'
+                    )
+                    % pycompat.sysbytes(fsencoding)
+                )
+                return False
+            # FIXME: The following reasoning and logic is wrong and will be
+            # fixed in a following changeset.
             # pycompat.fsdecode() / pycompat.fsencode() are used so that bytes
             # in the URL roundtrip correctly on Unix. urlreq.url2pathname() on
             # py3 will decode percent-encoded bytes using the utf-8 encoding
diff --git a/tests/test-convert-svn-encoding.t b/tests/test-convert-svn-encoding.t
--- a/tests/test-convert-svn-encoding.t
+++ b/tests/test-convert-svn-encoding.t
@@ -182,6 +182,20 @@
   cannot find required "p4" tool
   abort: \xff: missing or unsupported repository (glob) (esc)
   [255]
+  $ hg convert file://$TESTTMP/$XFF test
+  initializing destination test repository
+  Subversion requires that file URLs can be converted to Unicode using the current locale encoding (ascii)
+  file:/*/$TESTTMP/\xff does not look like a CVS checkout (glob) (esc)
+  $TESTTMP/file:$TESTTMP/\xff does not look like a Git repository (esc)
+  file:/*/$TESTTMP/\xff does not look like a Subversion repository (glob) (esc)
+  file:/*/$TESTTMP/\xff is not a local Mercurial repository (glob) (esc)
+  file:/*/$TESTTMP/\xff does not look like a darcs repository (glob) (esc)
+  file:/*/$TESTTMP/\xff does not look like a monotone repository (glob) (esc)
+  file:/*/$TESTTMP/\xff does not look like a GNU Arch repository (glob) (esc)
+  file:/*/$TESTTMP/\xff does not look like a Bazaar repository (glob) (esc)
+  file:/*/$TESTTMP/\xff does not look like a P4 repository (glob) (esc)
+  abort: file:/*/$TESTTMP/\xff: missing or unsupported repository (glob) (esc)
+  [255]
 
 #if py3
 For now, on Python 3, we abort when encountering non-UTF-8 percent-encoded
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

[PATCH 5 of 6 stable] tests: use path inside test dir

Manuel Jacob
In reply to this post by Manuel Jacob
# HG changeset patch
# User Manuel Jacob <[hidden email]>
# Date 1593473537 -7200
#      Tue Jun 30 01:32:17 2020 +0200
# Branch stable
# Node ID e1a4c7f23e804f37c3848fc408607af916d619d1
# Parent  388217b58dc5aa2e4b34b8fab67c17016abcc005
# EXP-Topic svn_encoding
tests: use path inside test dir

This will make the diff for the next patch less noisy.

diff --git a/tests/test-convert-svn-encoding.t b/tests/test-convert-svn-encoding.t
--- a/tests/test-convert-svn-encoding.t
+++ b/tests/test-convert-svn-encoding.t
@@ -201,18 +201,18 @@
 For now, on Python 3, we abort when encountering non-UTF-8 percent-encoded
 bytes in a filename.
 
-  $ hg convert file:///%ff test
+  $ hg convert file://$TESTTMP/%FF test
   initializing destination test repository
   on Python 3, we currently do not support non-UTF-8 percent-encoded bytes in file URLs for Subversion repositories
-  file:///%ff does not look like a CVS checkout
-  $TESTTMP/file:/%ff does not look like a Git repository
-  file:///%ff does not look like a Subversion repository
-  file:///%ff is not a local Mercurial repository
-  file:///%ff does not look like a darcs repository
-  file:///%ff does not look like a monotone repository
-  file:///%ff does not look like a GNU Arch repository
-  file:///%ff does not look like a Bazaar repository
-  file:///%ff does not look like a P4 repository
-  abort: file:///%ff: missing or unsupported repository
+  file:/*/$TESTTMP/%FF does not look like a CVS checkout (glob)
+  $TESTTMP/file:$TESTTMP/%FF does not look like a Git repository
+  file:/*/$TESTTMP/%FF does not look like a Subversion repository (glob)
+  file:/*/$TESTTMP/%FF is not a local Mercurial repository (glob)
+  file:/*/$TESTTMP/%FF does not look like a darcs repository (glob)
+  file:/*/$TESTTMP/%FF does not look like a monotone repository (glob)
+  file:/*/$TESTTMP/%FF does not look like a GNU Arch repository (glob)
+  file:/*/$TESTTMP/%FF does not look like a Bazaar repository (glob)
+  file:/*/$TESTTMP/%FF does not look like a P4 repository (glob)
+  abort: file:/*/$TESTTMP/%FF: missing or unsupported repository (glob)
   [255]
 #endif

_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

[PATCH 6 of 6 stable] convert: handle percent-encoded bytes in file URLs like Subversion

Manuel Jacob
In reply to this post by Manuel Jacob
# HG changeset patch
# User Manuel Jacob <[hidden email]>
# Date 1593494609 -7200
#      Tue Jun 30 07:23:29 2020 +0200
# Branch stable
# Node ID 9915fdff8d1732ce62b6df69b50106384d4ad4d1
# Parent  e1a4c7f23e804f37c3848fc408607af916d619d1
# EXP-Topic svn_encoding
convert: handle percent-encoded bytes in file URLs like Subversion

75b59d221aa3 added most of the code that gets removed by this patch. It helped
making progress on Python 3, but the reasoning was wrong in many ways. I tried
to retract it while it was queued, but it was too late.

Back then, I was asssuming that what happened on Python 2 (preserving bytes) is
correct and my Python 3 change is a hack. However it turned out that Subversion
interprets percent-encoded bytes as UTF-8. Accepting the same format as
Subversion is a good idea.

Consistency with urlreq.pathname2url() (as described in the removed comment)
doesn’t matter because that function is only used for passing paths to urllib.

This is not a backwards-incompatible change because before 5c0d5b48e58c,
non-ASCII filenames didn’t work at all on Python 2.

When the locale encoding is ISO-8859-15, `svn` accepts `file:///tmp/a%E2%82%AC`
for `/tmp/a€`. Before this patch, this was the case for this extension on
Python 3, but not on Python 2. This patch makes it work like with `svn` on both
Python 2 and Python 3.

diff --git a/hgext/convert/subversion.py b/hgext/convert/subversion.py
--- a/hgext/convert/subversion.py
+++ b/hgext/convert/subversion.py
@@ -349,6 +349,32 @@
 }
 
 
+class NonUtf8PercentEncodedBytes(Exception):
+    pass
+
+
+# Subversion paths are Unicode. Since the percent-decoding is done on
+# UTF-8-encoded strings, percent-encoded bytes are interpreted as UTF-8.
+def url2pathname_like_subversion(unicodepath):
+    if pycompat.ispy3:
+        # On Python 3, we have to pass unicode to urlreq.url2pathname().
+        # Percent-decoded bytes get decoded using UTF-8 and the 'replace' error
+        # handler.
+        unicodepath = urlreq.url2pathname(unicodepath)
+        if u'\N{REPLACEMENT CHARACTER}' in unicodepath:
+            raise NonUtf8PercentEncodedBytes
+        else:
+            return unicodepath
+    else:
+        # If we passed unicode on Python 2, it would be converted using the
+        # latin-1 encoding. Therefore, we pass UTF-8-encoded bytes.
+        unicodepath = urlreq.url2pathname(unicodepath.encode('utf-8'))
+        try:
+            return unicodepath.decode('utf-8')
+        except UnicodeDecodeError:
+            raise NonUtf8PercentEncodedBytes
+
+
 def issvnurl(ui, url):
     try:
         proto, path = url.split(b'://', 1)
@@ -361,7 +387,7 @@
             ):
                 path = path[:2] + b':/' + path[6:]
             try:
-                path.decode(fsencoding)
+                unicodepath = path.decode(fsencoding)
             except UnicodeDecodeError:
                 ui.warn(
                     _(
@@ -371,28 +397,17 @@
                     % pycompat.sysbytes(fsencoding)
                 )
                 return False
-            # FIXME: The following reasoning and logic is wrong and will be
-            # fixed in a following changeset.
-            # pycompat.fsdecode() / pycompat.fsencode() are used so that bytes
-            # in the URL roundtrip correctly on Unix. urlreq.url2pathname() on
-            # py3 will decode percent-encoded bytes using the utf-8 encoding
-            # and the "replace" error handler. This means that it will not
-            # preserve non-UTF-8 bytes (https://bugs.python.org/issue40983).
-            # url.open() uses the reverse function (urlreq.pathname2url()) and
-            # has a similar problem
-            # (https://bz.mercurial-scm.org/show_bug.cgi?id=6357). It makes
-            # sense to solve both problems together and handle all file URLs
-            # consistently. For now, we warn.
-            unicodepath = urlreq.url2pathname(pycompat.fsdecode(path))
-            if pycompat.ispy3 and u'\N{REPLACEMENT CHARACTER}' in unicodepath:
+            try:
+                unicodepath = url2pathname_like_subversion(unicodepath)
+            except NonUtf8PercentEncodedBytes:
                 ui.warn(
                     _(
-                        b'on Python 3, we currently do not support non-UTF-8 '
-                        b'percent-encoded bytes in file URLs for Subversion '
-                        b'repositories\n'
+                        b'Subversion does not support non-UTF-8 '
+                        b'percent-encoded bytes in file URLs\n'
                     )
                 )
-            path = pycompat.fsencode(unicodepath)
+                return False
+            path = unicodepath.encode(fsencoding)
     except ValueError:
         proto = b'file'
         path = os.path.abspath(url)
diff --git a/tests/test-convert-svn-encoding.t b/tests/test-convert-svn-encoding.t
--- a/tests/test-convert-svn-encoding.t
+++ b/tests/test-convert-svn-encoding.t
@@ -197,13 +197,13 @@
   abort: file:/*/$TESTTMP/\xff: missing or unsupported repository (glob) (esc)
   [255]
 
-#if py3
-For now, on Python 3, we abort when encountering non-UTF-8 percent-encoded
-bytes in a filename.
+Subversion decodes percent-encoded bytes on the converted, UTF-8-encoded
+string. Therefore, if the percent-encoded bytes aren't valid UTF-8, Subversion
+would choke on them when converting them to the locale encoding.
 
   $ hg convert file://$TESTTMP/%FF test
   initializing destination test repository
-  on Python 3, we currently do not support non-UTF-8 percent-encoded bytes in file URLs for Subversion repositories
+  Subversion does not support non-UTF-8 percent-encoded bytes in file URLs
   file:/*/$TESTTMP/%FF does not look like a CVS checkout (glob)
   $TESTTMP/file:$TESTTMP/%FF does not look like a Git repository
   file:/*/$TESTTMP/%FF does not look like a Subversion repository (glob)
@@ -215,4 +215,3 @@
   file:/*/$TESTTMP/%FF does not look like a P4 repository (glob)
   abort: file:/*/$TESTTMP/%FF: missing or unsupported repository (glob)
   [255]
-#endif
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3 of 6 stable] convert: correctly convert paths to UTF-8 for Subversion

Yuya Nishihara
In reply to this post by Manuel Jacob
On Tue, 30 Jun 2020 08:45:44 +0200, Manuel Jacob wrote:
> # HG changeset patch
> # User Manuel Jacob <[hidden email]>
> # Date 1593435816 -7200
> #      Mon Jun 29 15:03:36 2020 +0200
> # Branch stable
> # Node ID 7bd48930ea77337213859f562e2fc0abd6734830
> # Parent  4d00ac33053273ed2b9a6431800d59df94adcfc3
> # EXP-Topic svn_encoding
> convert: correctly convert paths to UTF-8 for Subversion

> @@ -117,7 +151,7 @@
>              path = b'/' + util.normpath(path)
>          # Module URL is later compared with the repository URL returned
>          # by svn API, which is UTF-8.
> -        path = encoding.tolocal(path)
> +        path = fs2svn(path)
>          path = b'file://%s' % quote(path)
>      return svn.core.svn_path_canonicalize(path)

It's better to document that geturl() may raise UnicodeDecodeError.

> --- a/tests/test-convert-svn-encoding.t
> +++ b/tests/test-convert-svn-encoding.t
> @@ -163,6 +163,26 @@
>    abort: <a href="http://localhost:$HGPORT/">http://localhost:$HGPORT/\xff: missing or unsupported repository (esc)
>    [255]
>  
> +In Subversion, paths are Unicode (encoded as UTF-8). Therefore paths that can't
> +be converted between UTF-8 and the locale encoding (which is always ASCII in
> +tests) don't work.
> +
> +  $ cp -R svn-repo $XFF

I suspect this would fail on Windows depending on system locale.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6 of 6 stable] convert: handle percent-encoded bytes in file URLs like Subversion

Yuya Nishihara
In reply to this post by Manuel Jacob
On Tue, 30 Jun 2020 08:45:47 +0200, Manuel Jacob wrote:
> # HG changeset patch
> # User Manuel Jacob <[hidden email]>
> # Date 1593494609 -7200
> #      Tue Jun 30 07:23:29 2020 +0200
> # Branch stable
> # Node ID 9915fdff8d1732ce62b6df69b50106384d4ad4d1
> # Parent  e1a4c7f23e804f37c3848fc408607af916d619d1
> # EXP-Topic svn_encoding
> convert: handle percent-encoded bytes in file URLs like Subversion

>  def issvnurl(ui, url):
>      try:
>          proto, path = url.split(b'://', 1)
> @@ -361,7 +387,7 @@
>              ):
>                  path = path[:2] + b':/' + path[6:]
>              try:
> -                path.decode(fsencoding)
> +                unicodepath = path.decode(fsencoding)
>              except UnicodeDecodeError:
>                  ui.warn(
>                      _(
> @@ -371,28 +397,17 @@
>                      % pycompat.sysbytes(fsencoding)
>                  )
>                  return False
> -            # FIXME: The following reasoning and logic is wrong and will be
> -            # fixed in a following changeset.
> -            # pycompat.fsdecode() / pycompat.fsencode() are used so that bytes
> -            # in the URL roundtrip correctly on Unix. urlreq.url2pathname() on
> -            # py3 will decode percent-encoded bytes using the utf-8 encoding
> -            # and the "replace" error handler. This means that it will not
> -            # preserve non-UTF-8 bytes (https://bugs.python.org/issue40983).
> -            # url.open() uses the reverse function (urlreq.pathname2url()) and
> -            # has a similar problem
> -            # (https://bz.mercurial-scm.org/show_bug.cgi?id=6357). It makes
> -            # sense to solve both problems together and handle all file URLs
> -            # consistently. For now, we warn.
> -            unicodepath = urlreq.url2pathname(pycompat.fsdecode(path))
> -            if pycompat.ispy3 and u'\N{REPLACEMENT CHARACTER}' in unicodepath:
> +            try:
> +                unicodepath = url2pathname_like_subversion(unicodepath)
> +            except NonUtf8PercentEncodedBytes:
>                  ui.warn(
>                      _(
> -                        b'on Python 3, we currently do not support non-UTF-8 '
> -                        b'percent-encoded bytes in file URLs for Subversion '
> -                        b'repositories\n'
> +                        b'Subversion does not support non-UTF-8 '
> +                        b'percent-encoded bytes in file URLs\n'
>                      )
>                  )
> -            path = pycompat.fsencode(unicodepath)
> +                return False
> +            path = unicodepath.encode(fsencoding)

I think pycompat.fsencode() is more correct since the path will be later
tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
encoding.unitolocal() is okay.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1 of 6 stable] convert: bail out in Subversion source if encountering non-ASCII HTTP(S) URL

Yuya Nishihara
In reply to this post by Manuel Jacob
On Tue, 30 Jun 2020 08:45:42 +0200, Manuel Jacob wrote:
> # HG changeset patch
> # User Manuel Jacob <[hidden email]>
> # Date 1593485752 -7200
> #      Tue Jun 30 04:55:52 2020 +0200
> # Branch stable
> # Node ID 2d738426309a0804a44c2024b97cef11aece6dde
> # Parent  bd0f122f3f51939b2adc7d6f9d8611692b1ebdce
> # EXP-Topic svn_encoding
> convert: bail out in Subversion source if encountering non-ASCII HTTP(S) URL

Queued 1-5, thanks.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6 of 6 stable] convert: handle percent-encoded bytes in file URLs like Subversion

Manuel Jacob
In reply to this post by Yuya Nishihara
On 2020-06-30 14:24, Yuya Nishihara wrote:

> On Tue, 30 Jun 2020 08:45:47 +0200, Manuel Jacob wrote:
>> # HG changeset patch
>> # User Manuel Jacob <[hidden email]>
>> # Date 1593494609 -7200
>> #      Tue Jun 30 07:23:29 2020 +0200
>> # Branch stable
>> # Node ID 9915fdff8d1732ce62b6df69b50106384d4ad4d1
>> # Parent  e1a4c7f23e804f37c3848fc408607af916d619d1
>> # EXP-Topic svn_encoding
>> convert: handle percent-encoded bytes in file URLs like Subversion
>
>>  def issvnurl(ui, url):
>>      try:
>>          proto, path = url.split(b'://', 1)
>> @@ -361,7 +387,7 @@
>>              ):
>>                  path = path[:2] + b':/' + path[6:]
>>              try:
>> -                path.decode(fsencoding)
>> +                unicodepath = path.decode(fsencoding)
>>              except UnicodeDecodeError:
>>                  ui.warn(
>>                      _(
>> @@ -371,28 +397,17 @@
>>                      % pycompat.sysbytes(fsencoding)
>>                  )
>>                  return False
>> -            # FIXME: The following reasoning and logic is wrong and
>> will be
>> -            # fixed in a following changeset.
>> -            # pycompat.fsdecode() / pycompat.fsencode() are used so
>> that bytes
>> -            # in the URL roundtrip correctly on Unix.
>> urlreq.url2pathname() on
>> -            # py3 will decode percent-encoded bytes using the utf-8
>> encoding
>> -            # and the "replace" error handler. This means that it
>> will not
>> -            # preserve non-UTF-8 bytes
>> (https://bugs.python.org/issue40983).
>> -            # url.open() uses the reverse function
>> (urlreq.pathname2url()) and
>> -            # has a similar problem
>> -            # (https://bz.mercurial-scm.org/show_bug.cgi?id=6357). It
>> makes
>> -            # sense to solve both problems together and handle all
>> file URLs
>> -            # consistently. For now, we warn.
>> -            unicodepath =
>> urlreq.url2pathname(pycompat.fsdecode(path))
>> -            if pycompat.ispy3 and u'\N{REPLACEMENT CHARACTER}' in
>> unicodepath:
>> +            try:
>> +                unicodepath =
>> url2pathname_like_subversion(unicodepath)
>> +            except NonUtf8PercentEncodedBytes:
>>                  ui.warn(
>>                      _(
>> -                        b'on Python 3, we currently do not support
>> non-UTF-8 '
>> -                        b'percent-encoded bytes in file URLs for
>> Subversion '
>> -                        b'repositories\n'
>> +                        b'Subversion does not support non-UTF-8 '
>> +                        b'percent-encoded bytes in file URLs\n'
>>                      )
>>                  )
>> -            path = pycompat.fsencode(unicodepath)
>> +                return False
>> +            path = unicodepath.encode(fsencoding)
>
> I think pycompat.fsencode() is more correct since the path will be
> later
> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
> encoding.unitolocal() is okay.

It was deliberate to use something that fails when it can’t be encoded.
However, I thought about failing in case of a logic error, not about
that the URL could contain a (UTF-8-encoded) code point that is not
available in `fsencoding`.

On Windows, Subversion will be able to handle all code points. We’re
unnecessarily restricting ourselves by doing the rest of the function on
bytes. Although not entirely Mercurial-idiomatic, we should maybe
consider doing the rest of the function in Unicode. We can also bail
out, but we have to make sure the warning mentions that it’s not the
user’s fault.

On Unix, Subversion will try later to convert the path to the locale
encoding, which fails if the code point is not encodable. We should bail
out in this case.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6 of 6 stable] convert: handle percent-encoded bytes in file URLs like Subversion

Manuel Jacob
On 2020-06-30 15:25, Manuel Jacob wrote:

> On 2020-06-30 14:24, Yuya Nishihara wrote:
>> On Tue, 30 Jun 2020 08:45:47 +0200, Manuel Jacob wrote:
>>> # HG changeset patch
>>> # User Manuel Jacob <[hidden email]>
>>> # Date 1593494609 -7200
>>> #      Tue Jun 30 07:23:29 2020 +0200
>>> # Branch stable
>>> # Node ID 9915fdff8d1732ce62b6df69b50106384d4ad4d1
>>> # Parent  e1a4c7f23e804f37c3848fc408607af916d619d1
>>> # EXP-Topic svn_encoding
>>> convert: handle percent-encoded bytes in file URLs like Subversion
>>
>>>  def issvnurl(ui, url):
>>>      try:
>>>          proto, path = url.split(b'://', 1)
>>> @@ -361,7 +387,7 @@
>>>              ):
>>>                  path = path[:2] + b':/' + path[6:]
>>>              try:
>>> -                path.decode(fsencoding)
>>> +                unicodepath = path.decode(fsencoding)
>>>              except UnicodeDecodeError:
>>>                  ui.warn(
>>>                      _(
>>> @@ -371,28 +397,17 @@
>>>                      % pycompat.sysbytes(fsencoding)
>>>                  )
>>>                  return False
>>> -            # FIXME: The following reasoning and logic is wrong and
>>> will be
>>> -            # fixed in a following changeset.
>>> -            # pycompat.fsdecode() / pycompat.fsencode() are used so
>>> that bytes
>>> -            # in the URL roundtrip correctly on Unix.
>>> urlreq.url2pathname() on
>>> -            # py3 will decode percent-encoded bytes using the utf-8
>>> encoding
>>> -            # and the "replace" error handler. This means that it
>>> will not
>>> -            # preserve non-UTF-8 bytes
>>> (https://bugs.python.org/issue40983).
>>> -            # url.open() uses the reverse function
>>> (urlreq.pathname2url()) and
>>> -            # has a similar problem
>>> -            # (https://bz.mercurial-scm.org/show_bug.cgi?id=6357).
>>> It makes
>>> -            # sense to solve both problems together and handle all
>>> file URLs
>>> -            # consistently. For now, we warn.
>>> -            unicodepath =
>>> urlreq.url2pathname(pycompat.fsdecode(path))
>>> -            if pycompat.ispy3 and u'\N{REPLACEMENT CHARACTER}' in
>>> unicodepath:
>>> +            try:
>>> +                unicodepath =
>>> url2pathname_like_subversion(unicodepath)
>>> +            except NonUtf8PercentEncodedBytes:
>>>                  ui.warn(
>>>                      _(
>>> -                        b'on Python 3, we currently do not support
>>> non-UTF-8 '
>>> -                        b'percent-encoded bytes in file URLs for
>>> Subversion '
>>> -                        b'repositories\n'
>>> +                        b'Subversion does not support non-UTF-8 '
>>> +                        b'percent-encoded bytes in file URLs\n'
>>>                      )
>>>                  )
>>> -            path = pycompat.fsencode(unicodepath)
>>> +                return False
>>> +            path = unicodepath.encode(fsencoding)
>>
>> I think pycompat.fsencode() is more correct since the path will be
>> later
>> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
>> encoding.unitolocal() is okay.
>
> It was deliberate to use something that fails when it can’t be
> encoded. However, I thought about failing in case of a logic error,
> not about that the URL could contain a (UTF-8-encoded) code point that
> is not available in `fsencoding`.
>
> On Windows, Subversion will be able to handle all code points. We’re
> unnecessarily restricting ourselves by doing the rest of the function
> on bytes. Although not entirely Mercurial-idiomatic, we should maybe
> consider doing the rest of the function in Unicode. We can also bail
> out, but we have to make sure the warning mentions that it’s not the
> user’s fault.
>
> On Unix, Subversion will try later to convert the path to the locale
> encoding, which fails if the code point is not encodable. We should
> bail out in this case.

Let me add that with "bail out", I included that a warning describing
the problem is shown.

Since what I described is mostly about showing nicer messages, should
they be added in an amended patch or as follow-ups?
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3 of 6 stable] convert: correctly convert paths to UTF-8 for Subversion

Manuel Jacob
In reply to this post by Yuya Nishihara
On 2020-06-30 14:02, Yuya Nishihara wrote:

> On Tue, 30 Jun 2020 08:45:44 +0200, Manuel Jacob wrote:
>> # HG changeset patch
>> # User Manuel Jacob <[hidden email]>
>> # Date 1593435816 -7200
>> #      Mon Jun 29 15:03:36 2020 +0200
>> # Branch stable
>> # Node ID 7bd48930ea77337213859f562e2fc0abd6734830
>> # Parent  4d00ac33053273ed2b9a6431800d59df94adcfc3
>> # EXP-Topic svn_encoding
>> convert: correctly convert paths to UTF-8 for Subversion
>
>> @@ -117,7 +151,7 @@
>>              path = b'/' + util.normpath(path)
>>          # Module URL is later compared with the repository URL
>> returned
>>          # by svn API, which is UTF-8.
>> -        path = encoding.tolocal(path)
>> +        path = fs2svn(path)
>>          path = b'file://%s' % quote(path)
>>      return svn.core.svn_path_canonicalize(path)
>
> It's better to document that geturl() may raise UnicodeDecodeError.

I think we should handle all cases where it could fail and bail out with
a useful warning.

* This is ideally the case for everything handled by issvnurl() (see my
replies on patch 6).
* In the case `os.path.exists(url) and os.path.exists(os.path.join(url,
b'.svn'))`,
** On Windows,
*** if the path cannot be MBCS-decoded, I think it’s impossible that
they’re passing to the source, but passing it to os.path() functions
would fail;
*** if the path can be MBCS-decoded, `fs2svn` can’t fail.
** On Unix, (almost) arbitrary bytes could exist. If we can’t convert
them to UTF-8, Subversion won’t be able to handle them. In this case, we
should bail out with warning describing the problem.
* For `svn://` and `svn+ssh://` URLs, if they are handled weirdly by
Subversion like HTTP URLs, we should restrict them to ASCII, otherwise
we should do the same as for paths.

>> --- a/tests/test-convert-svn-encoding.t
>> +++ b/tests/test-convert-svn-encoding.t
>> @@ -163,6 +163,26 @@
>>    abort: <a href="http://localhost:$HGPORT/">http://localhost:$HGPORT/\xff: missing or unsupported
>> repository (esc)
>>    [255]
>>
>> +In Subversion, paths are Unicode (encoded as UTF-8). Therefore paths
>> that can't
>> +be converted between UTF-8 and the locale encoding (which is always
>> ASCII in
>> +tests) don't work.
>> +
>> +  $ cp -R svn-repo $XFF
>
> I suspect this would fail on Windows depending on system locale.

It should be encodable by the current code page, but not be ASCII. Is
that even possible in general?
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6 of 6 stable] convert: handle percent-encoded bytes in file URLs like Subversion

Yuya Nishihara
In reply to this post by Manuel Jacob
On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:

> >>> -            path = pycompat.fsencode(unicodepath)
> >>> +                return False
> >>> +            path = unicodepath.encode(fsencoding)
> >>
> >> I think pycompat.fsencode() is more correct since the path will be
> >> later
> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
> >> encoding.unitolocal() is okay.
> >
> > It was deliberate to use something that fails when it can’t be
> > encoded. However, I thought about failing in case of a logic error,
> > not about that the URL could contain a (UTF-8-encoded) code point that
> > is not available in `fsencoding`.
> >
> > On Windows, Subversion will be able to handle all code points. We’re
> > unnecessarily restricting ourselves by doing the rest of the function
> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
> > consider doing the rest of the function in Unicode. We can also bail
> > out, but we have to make sure the warning mentions that it’s not the
> > user’s fault.
> >
> > On Unix, Subversion will try later to convert the path to the locale
> > encoding, which fails if the code point is not encodable. We should
> > bail out in this case.
>
> Let me add that with "bail out", I included that a warning describing
> the problem is shown.
>
> Since what I described is mostly about showing nicer messages, should
> they be added in an amended patch or as follow-ups?

I'm not sure if I understood what you mean.

I just pointed out that the fsencoding is the encoding in which Subversion
will convert unicode to bytes, whereas fsencode() is the function for Python
layer. And the path should be a bytes encoded in Python way.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3 of 6 stable] convert: correctly convert paths to UTF-8 for Subversion

Yuya Nishihara
In reply to this post by Manuel Jacob
On Tue, 30 Jun 2020 15:56:49 +0200, Manuel Jacob wrote:

> On 2020-06-30 14:02, Yuya Nishihara wrote:
> > On Tue, 30 Jun 2020 08:45:44 +0200, Manuel Jacob wrote:
> >> # HG changeset patch
> >> # User Manuel Jacob <[hidden email]>
> >> # Date 1593435816 -7200
> >> #      Mon Jun 29 15:03:36 2020 +0200
> >> # Branch stable
> >> # Node ID 7bd48930ea77337213859f562e2fc0abd6734830
> >> # Parent  4d00ac33053273ed2b9a6431800d59df94adcfc3
> >> # EXP-Topic svn_encoding
> >> convert: correctly convert paths to UTF-8 for Subversion
> >
> >> @@ -117,7 +151,7 @@
> >>              path = b'/' + util.normpath(path)
> >>          # Module URL is later compared with the repository URL
> >> returned
> >>          # by svn API, which is UTF-8.
> >> -        path = encoding.tolocal(path)
> >> +        path = fs2svn(path)
> >>          path = b'file://%s' % quote(path)
> >>      return svn.core.svn_path_canonicalize(path)
> >
> > It's better to document that geturl() may raise UnicodeDecodeError.
>
> I think we should handle all cases where it could fail and bail out with
> a useful warning.
>
> * This is ideally the case for everything handled by issvnurl() (see my
> replies on patch 6).

I know, but geturl() function itself can't handle arbitrary bytes. It's
obvious that fs2svn() may raise unicode exception, but at geturl() level,
it's getting unclear. Since bytes function doesn't generally raise unicode
exception in Mercurial codebase, I think it's better to document about the
design of geturl() function.

> >> --- a/tests/test-convert-svn-encoding.t
> >> +++ b/tests/test-convert-svn-encoding.t
> >> @@ -163,6 +163,26 @@
> >>    abort: <a href="http://localhost:$HGPORT/">http://localhost:$HGPORT/\xff: missing or unsupported
> >> repository (esc)
> >>    [255]
> >>
> >> +In Subversion, paths are Unicode (encoded as UTF-8). Therefore paths
> >> that can't
> >> +be converted between UTF-8 and the locale encoding (which is always
> >> ASCII in
> >> +tests) don't work.
> >> +
> >> +  $ cp -R svn-repo $XFF
> >
> > I suspect this would fail on Windows depending on system locale.
>
> It should be encodable by the current code page, but not be ASCII. Is
> that even possible in general?

I don't know. I can only say that 0xff can't be decoded as *some* variant
of Shift_JISes. I have no idea how Win32API layer will handle that.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6 of 6 stable] convert: handle percent-encoded bytes in file URLs like Subversion

Manuel Jacob
In reply to this post by Yuya Nishihara
On 2020-06-30 16:03, Yuya Nishihara wrote:

> On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:
>> >>> -            path = pycompat.fsencode(unicodepath)
>> >>> +                return False
>> >>> +            path = unicodepath.encode(fsencoding)
>> >>
>> >> I think pycompat.fsencode() is more correct since the path will be
>> >> later
>> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
>> >> encoding.unitolocal() is okay.
>> >
>> > It was deliberate to use something that fails when it can’t be
>> > encoded. However, I thought about failing in case of a logic error,
>> > not about that the URL could contain a (UTF-8-encoded) code point that
>> > is not available in `fsencoding`.
>> >
>> > On Windows, Subversion will be able to handle all code points. We’re
>> > unnecessarily restricting ourselves by doing the rest of the function
>> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
>> > consider doing the rest of the function in Unicode. We can also bail
>> > out, but we have to make sure the warning mentions that it’s not the
>> > user’s fault.
>> >
>> > On Unix, Subversion will try later to convert the path to the locale
>> > encoding, which fails if the code point is not encodable. We should
>> > bail out in this case.
>>
>> Let me add that with "bail out", I included that a warning describing
>> the problem is shown.
>>
>> Since what I described is mostly about showing nicer messages, should
>> they be added in an amended patch or as follow-ups?
>
> I'm not sure if I understood what you mean.
>
> I just pointed out that the fsencoding is the encoding in which
> Subversion
> will convert unicode to bytes, whereas fsencode() is the function for
> Python
> layer. And the path should be a bytes encoded in Python way.

Subversion will do a similar check as in the rest of the function to
determine whether the repo is actually a SVN repo.

If we use a different encoding, we have a higher chance of running into
cases where our function thinks it’s a SVN repo, while Subversion thinks
it’s not (and vice versa). Therefore, I think we should use
`fsencoding`.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6 of 6 stable] convert: handle percent-encoded bytes in file URLs like Subversion

Yuya Nishihara
On Tue, 30 Jun 2020 16:28:55 +0200, Manuel Jacob wrote:

> On 2020-06-30 16:03, Yuya Nishihara wrote:
> > On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:
> >> >>> -            path = pycompat.fsencode(unicodepath)
> >> >>> +                return False
> >> >>> +            path = unicodepath.encode(fsencoding)
> >> >>
> >> >> I think pycompat.fsencode() is more correct since the path will be
> >> >> later
> >> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
> >> >> encoding.unitolocal() is okay.
> >> >
> >> > It was deliberate to use something that fails when it can’t be
> >> > encoded. However, I thought about failing in case of a logic error,
> >> > not about that the URL could contain a (UTF-8-encoded) code point that
> >> > is not available in `fsencoding`.
> >> >
> >> > On Windows, Subversion will be able to handle all code points. We’re
> >> > unnecessarily restricting ourselves by doing the rest of the function
> >> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
> >> > consider doing the rest of the function in Unicode. We can also bail
> >> > out, but we have to make sure the warning mentions that it’s not the
> >> > user’s fault.
> >> >
> >> > On Unix, Subversion will try later to convert the path to the locale
> >> > encoding, which fails if the code point is not encodable. We should
> >> > bail out in this case.
> >>
> >> Let me add that with "bail out", I included that a warning describing
> >> the problem is shown.
> >>
> >> Since what I described is mostly about showing nicer messages, should
> >> they be added in an amended patch or as follow-ups?
> >
> > I'm not sure if I understood what you mean.
> >
> > I just pointed out that the fsencoding is the encoding in which
> > Subversion
> > will convert unicode to bytes, whereas fsencode() is the function for
> > Python
> > layer. And the path should be a bytes encoded in Python way.
>
> Subversion will do a similar check as in the rest of the function to
> determine whether the repo is actually a SVN repo.
>
> If we use a different encoding, we have a higher chance of running into
> cases where our function thinks it’s a SVN repo, while Subversion thinks
> it’s not (and vice versa). Therefore, I think we should use
> `fsencoding`.

Then, can you adjust the documentation of fsencoding? Without that, I would
see there are at least three types of path variables:

 - py_fs_bytes_t (or maybe hg_fs_bytes_t)
 - svn_fs_bytes_t
 - unicode_t (or svn_url_unicode_t)

Be aware that the path will be converted back to UCS2-ish encoding on
Windows by either Python 3 or Win32 ANSI API.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6 of 6 stable] convert: handle percent-encoded bytes in file URLs like Subversion

Manuel Jacob
On 2020-06-30 16:48, Yuya Nishihara wrote:

> On Tue, 30 Jun 2020 16:28:55 +0200, Manuel Jacob wrote:
>> On 2020-06-30 16:03, Yuya Nishihara wrote:
>> > On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:
>> >> >>> -            path = pycompat.fsencode(unicodepath)
>> >> >>> +                return False
>> >> >>> +            path = unicodepath.encode(fsencoding)
>> >> >>
>> >> >> I think pycompat.fsencode() is more correct since the path will be
>> >> >> later
>> >> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
>> >> >> encoding.unitolocal() is okay.
>> >> >
>> >> > It was deliberate to use something that fails when it can’t be
>> >> > encoded. However, I thought about failing in case of a logic error,
>> >> > not about that the URL could contain a (UTF-8-encoded) code point that
>> >> > is not available in `fsencoding`.
>> >> >
>> >> > On Windows, Subversion will be able to handle all code points. We’re
>> >> > unnecessarily restricting ourselves by doing the rest of the function
>> >> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
>> >> > consider doing the rest of the function in Unicode. We can also bail
>> >> > out, but we have to make sure the warning mentions that it’s not the
>> >> > user’s fault.
>> >> >
>> >> > On Unix, Subversion will try later to convert the path to the locale
>> >> > encoding, which fails if the code point is not encodable. We should
>> >> > bail out in this case.
>> >>
>> >> Let me add that with "bail out", I included that a warning describing
>> >> the problem is shown.
>> >>
>> >> Since what I described is mostly about showing nicer messages, should
>> >> they be added in an amended patch or as follow-ups?
>> >
>> > I'm not sure if I understood what you mean.
>> >
>> > I just pointed out that the fsencoding is the encoding in which
>> > Subversion
>> > will convert unicode to bytes, whereas fsencode() is the function for
>> > Python
>> > layer. And the path should be a bytes encoded in Python way.
>>
>> Subversion will do a similar check as in the rest of the function to
>> determine whether the repo is actually a SVN repo.
>>
>> If we use a different encoding, we have a higher chance of running
>> into
>> cases where our function thinks it’s a SVN repo, while Subversion
>> thinks
>> it’s not (and vice versa). Therefore, I think we should use
>> `fsencoding`.
>
> Then, can you adjust the documentation of fsencoding? Without that, I
> would
> see there are at least three types of path variables:
>
>  - py_fs_bytes_t (or maybe hg_fs_bytes_t)
>  - svn_fs_bytes_t
>  - unicode_t (or svn_url_unicode_t)
>
> Be aware that the path will be converted back to UCS2-ish encoding on
> Windows by either Python 3 or Win32 ANSI API.

The basic distinction is between:

* Mercurial stores paths as bytes.
* The type for Subversion paths and URLs is Unicode. We store them as
UTF-8-encoded bytes (because that is what the Subversion C API accepts),
except for `unicodepath`, which is of Python type `unicode`.
* The OS uses bytes or Unicode for paths. I call that "native string" in
the documentation for `fsencoding`.

In the bottom of issvnurl(), we approximate how Subversion checks the
URL. On Unix, we should therefore convert the path to `fsencoding` (like
Subversion does). On Windows, the right thing would actually be to leave
the path as unicode. For now, I think that we can restrict the URL to
MBCS, given that before, only ASCII worked reliably.

(Part of) that reasoning could be put into a code comment inside
issvnurl().

What piece of information is missing in the documentation for
`fsencoding`?

Obviously the name is not perfect. It’s for the encoding that should be
used for converting hg_fs_bytes_t to Unicode. Plus, On Unix it’s also
the encoding that Subversion actually uses to convert from Unicode to OS
bytes.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6 of 6 stable] convert: handle percent-encoded bytes in file URLs like Subversion

Yuya Nishihara
On Wed, 01 Jul 2020 06:23:19 +0200, Manuel Jacob wrote:

> On 2020-06-30 16:48, Yuya Nishihara wrote:
> > On Tue, 30 Jun 2020 16:28:55 +0200, Manuel Jacob wrote:
> >> On 2020-06-30 16:03, Yuya Nishihara wrote:
> >> > On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:
> >> >> >>> -            path = pycompat.fsencode(unicodepath)
> >> >> >>> +                return False
> >> >> >>> +            path = unicodepath.encode(fsencoding)
> >> >> >>
> >> >> >> I think pycompat.fsencode() is more correct since the path will be
> >> >> >> later
> >> >> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
> >> >> >> encoding.unitolocal() is okay.
> >> >> >
> >> >> > It was deliberate to use something that fails when it can’t be
> >> >> > encoded. However, I thought about failing in case of a logic error,
> >> >> > not about that the URL could contain a (UTF-8-encoded) code point that
> >> >> > is not available in `fsencoding`.
> >> >> >
> >> >> > On Windows, Subversion will be able to handle all code points. We’re
> >> >> > unnecessarily restricting ourselves by doing the rest of the function
> >> >> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
> >> >> > consider doing the rest of the function in Unicode. We can also bail
> >> >> > out, but we have to make sure the warning mentions that it’s not the
> >> >> > user’s fault.
> >> >> >
> >> >> > On Unix, Subversion will try later to convert the path to the locale
> >> >> > encoding, which fails if the code point is not encodable. We should
> >> >> > bail out in this case.
> >> >>
> >> >> Let me add that with "bail out", I included that a warning describing
> >> >> the problem is shown.
> >> >>
> >> >> Since what I described is mostly about showing nicer messages, should
> >> >> they be added in an amended patch or as follow-ups?
> >> >
> >> > I'm not sure if I understood what you mean.
> >> >
> >> > I just pointed out that the fsencoding is the encoding in which
> >> > Subversion
> >> > will convert unicode to bytes, whereas fsencode() is the function for
> >> > Python
> >> > layer. And the path should be a bytes encoded in Python way.
> >>
> >> Subversion will do a similar check as in the rest of the function to
> >> determine whether the repo is actually a SVN repo.
> >>
> >> If we use a different encoding, we have a higher chance of running
> >> into
> >> cases where our function thinks it’s a SVN repo, while Subversion
> >> thinks
> >> it’s not (and vice versa). Therefore, I think we should use
> >> `fsencoding`.
> >
> > Then, can you adjust the documentation of fsencoding? Without that, I
> > would
> > see there are at least three types of path variables:
> >
> >  - py_fs_bytes_t (or maybe hg_fs_bytes_t)
> >  - svn_fs_bytes_t
> >  - unicode_t (or svn_url_unicode_t)
> >
> > Be aware that the path will be converted back to UCS2-ish encoding on
> > Windows by either Python 3 or Win32 ANSI API.
>
> The basic distinction is between:
>
> * Mercurial stores paths as bytes.
> * The type for Subversion paths and URLs is Unicode. We store them as
> UTF-8-encoded bytes (because that is what the Subversion C API accepts),
> except for `unicodepath`, which is of Python type `unicode`.
> * The OS uses bytes or Unicode for paths. I call that "native string" in
> the documentation for `fsencoding`.
>
> In the bottom of issvnurl(), we approximate how Subversion checks the
> URL. On Unix, we should therefore convert the path to `fsencoding` (like
> Subversion does). On Windows, the right thing would actually be to leave
> the path as unicode. For now, I think that we can restrict the URL to
> MBCS, given that before, only ASCII worked reliably.
>
> (Part of) that reasoning could be put into a code comment inside
> issvnurl().

+1, and adding the inline comment will address my concern.

> What piece of information is missing in the documentation for
> `fsencoding`?

IIUC, it basically says fsencoding should be used when passing bytes to
Subversion where Subversion will convert it back. So passing "fsencod"-ed
bytes to Python os.* functions looks generally wrong (because Python has
its own encoding strategy on Windows.) That was my concern, but in this
issvnurl() case, it works as you've described above.
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel