[PATCH] pycompat: use os.fsencode() to re-encode sys.argv

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] pycompat: use os.fsencode() to re-encode sys.argv

Manuel Jacob
# HG changeset patch
# User Manuel Jacob <[hidden email]>
# Date 1593002661 -7200
#      Wed Jun 24 14:44:21 2020 +0200
# Node ID 0bc2e245a6f89bd5797e234874733f31778aea67
# Parent  c2df0bca0dfa30cdf177d37b852841002d670ee5
# EXP-Topic use_os_fsencode
pycompat: use os.fsencode() to re-encode sys.argv

Historically, the previous code made sense, as Py_EncodeLocale() and
fs.fsencode() could possibly use different encodings. However, this is not the
case anymore for Python 3.2, which uses the locale encoding as the filesystem
encoding (this is not true for later Python versions, but see below). See
https://vstinner.github.io/painful-history-python-filesystem-encoding.html for
a source and more background information.

Using os.fsencode() is safer, as the documentation for sys.argv says that it can
be used to get the original bytes. When doing further changes, the Python
developers will take care that this continues to work.

One concrete case where os.fsencode() is more correct is when enabling Python's
UTF-8 mode. Py_DecodeLocale() will use UTF-8 in this case. Our previous code
would have encoded it using the locale encoding (which might be different),
whereas os.fsencode() will encode it with UTF-8.

Since we don’t claim to support the UTF-8 mode, this is not really a bug and the
patch can go to the default branch. It might be a good idea to not commit this
to the stable branch, as it could in theory introduce regressions.

diff --git a/mercurial/pycompat.py b/mercurial/pycompat.py
--- a/mercurial/pycompat.py
+++ b/mercurial/pycompat.py
@@ -156,16 +156,10 @@
 
     if getattr(sys, 'argv', None) is not None:
         # On POSIX, the char** argv array is converted to Python str using
-        # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), which isn't
-        # directly callable from Python code. So, we need to emulate it.
-        # Py_DecodeLocale() calls mbstowcs() and falls back to mbrtowc() with
-        # surrogateescape error handling on failure. These functions take the
-        # current system locale into account. So, the inverse operation is to
-        # .encode() using the system locale's encoding and using the
-        # surrogateescape error handler. The only tricky part here is getting
-        # the system encoding correct, since `locale.getlocale()` can return
-        # None. We fall back to the filesystem encoding if lookups via `locale`
-        # fail, as this seems like a reasonable thing to do.
+        # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), which
+        # isn't directly callable from Python code. In practice, os.fsencode()
+        # can be used instead (this is recommended by Python's documentation
+        # for sys.argv).
         #
         # On Windows, the wchar_t **argv is passed into the interpreter as-is.
         # Like POSIX, we need to emulate what Py_EncodeLocale() would do. But
@@ -178,12 +172,7 @@
         if os.name == r'nt':
             sysargv = [a.encode("mbcs", "ignore") for a in sys.argv]
         else:
-            encoding = (
-                locale.getlocale()[1]
-                or locale.getdefaultlocale()[1]
-                or sys.getfilesystemencoding()
-            )
-            sysargv = [a.encode(encoding, "surrogateescape") for a in sys.argv]
+            sysargv = [fsencode(a) for a in sys.argv]
 
     bytechr = struct.Struct('>B').pack
     byterepr = b'%r'.__mod__
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] pycompat: use os.fsencode() to re-encode sys.argv

Manuel Jacob
This breaks tests
(https://foss.heptapod.net/octobus/mercurial-devel/pipelines/7240). I’ll
investigate.

On 2020-06-24 14:57, Manuel Jacob wrote:

> # HG changeset patch
> # User Manuel Jacob <[hidden email]>
> # Date 1593002661 -7200
> #      Wed Jun 24 14:44:21 2020 +0200
> # Node ID 0bc2e245a6f89bd5797e234874733f31778aea67
> # Parent  c2df0bca0dfa30cdf177d37b852841002d670ee5
> # EXP-Topic use_os_fsencode
> pycompat: use os.fsencode() to re-encode sys.argv
>
> Historically, the previous code made sense, as Py_EncodeLocale() and
> fs.fsencode() could possibly use different encodings. However, this is
> not the
> case anymore for Python 3.2, which uses the locale encoding as the
> filesystem
> encoding (this is not true for later Python versions, but see below).
> See
> https://vstinner.github.io/painful-history-python-filesystem-encoding.html 
> for
> a source and more background information.
>
> Using os.fsencode() is safer, as the documentation for sys.argv says
> that it can
> be used to get the original bytes. When doing further changes, the
> Python
> developers will take care that this continues to work.
>
> One concrete case where os.fsencode() is more correct is when enabling
> Python's
> UTF-8 mode. Py_DecodeLocale() will use UTF-8 in this case. Our previous
> code
> would have encoded it using the locale encoding (which might be
> different),
> whereas os.fsencode() will encode it with UTF-8.
>
> Since we don’t claim to support the UTF-8 mode, this is not really a
> bug and the
> patch can go to the default branch. It might be a good idea to not
> commit this
> to the stable branch, as it could in theory introduce regressions.
>
> diff --git a/mercurial/pycompat.py b/mercurial/pycompat.py
> --- a/mercurial/pycompat.py
> +++ b/mercurial/pycompat.py
> @@ -156,16 +156,10 @@
>
>      if getattr(sys, 'argv', None) is not None:
>          # On POSIX, the char** argv array is converted to Python str
> using
> -        # Py_DecodeLocale(). The inverse of this is
> Py_EncodeLocale(), which isn't
> -        # directly callable from Python code. So, we need to emulate
> it.
> -        # Py_DecodeLocale() calls mbstowcs() and falls back to
> mbrtowc() with
> -        # surrogateescape error handling on failure. These functions
> take the
> -        # current system locale into account. So, the inverse
> operation is to
> -        # .encode() using the system locale's encoding and using the
> -        # surrogateescape error handler. The only tricky part here is
> getting
> -        # the system encoding correct, since `locale.getlocale()` can
> return
> -        # None. We fall back to the filesystem encoding if lookups via
> `locale`
> -        # fail, as this seems like a reasonable thing to do.
> +        # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(),
> which
> +        # isn't directly callable from Python code. In practice,
> os.fsencode()
> +        # can be used instead (this is recommended by Python's
> documentation
> +        # for sys.argv).
>          #
>          # On Windows, the wchar_t **argv is passed into the
> interpreter as-is.
>          # Like POSIX, we need to emulate what Py_EncodeLocale() would
> do. But
> @@ -178,12 +172,7 @@
>          if os.name == r'nt':
>              sysargv = [a.encode("mbcs", "ignore") for a in sys.argv]
>          else:
> -            encoding = (
> -                locale.getlocale()[1]
> -                or locale.getdefaultlocale()[1]
> -                or sys.getfilesystemencoding()
> -            )
> -            sysargv = [a.encode(encoding, "surrogateescape") for a in
> sys.argv]
> +            sysargv = [fsencode(a) for a in sys.argv]
>
>      bytechr = struct.Struct('>B').pack
>      byterepr = b'%r'.__mod__
> _______________________________________________
> Mercurial-devel mailing list
> [hidden email]
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] pycompat: use os.fsencode() to re-encode sys.argv

Manuel Jacob
It turns out that the test breakage is not at all because of the changes
in the patch. It’s because of non-ASCII characters in the changeset
description. See
https://foss.heptapod.net/octobus/mercurial-devel/-/commit/e5930264da81a3a73b424efc46f114dbfb9bbe6d,
which also fails.

GitLab puts the changeset description in the environment variables
(maybe after today’s update?), which hgweb tries to encode using
latin-1. The solution is... more fsdecode()s. I hope this doesn’t sound
too dubious to be true. :)

On 2020-06-24 16:14, Manuel Jacob wrote:

> This breaks tests
> (https://foss.heptapod.net/octobus/mercurial-devel/pipelines/7240).
> I’ll investigate.
>
> On 2020-06-24 14:57, Manuel Jacob wrote:
>> # HG changeset patch
>> # User Manuel Jacob <[hidden email]>
>> # Date 1593002661 -7200
>> #      Wed Jun 24 14:44:21 2020 +0200
>> # Node ID 0bc2e245a6f89bd5797e234874733f31778aea67
>> # Parent  c2df0bca0dfa30cdf177d37b852841002d670ee5
>> # EXP-Topic use_os_fsencode
>> pycompat: use os.fsencode() to re-encode sys.argv
>>
>> Historically, the previous code made sense, as Py_EncodeLocale() and
>> fs.fsencode() could possibly use different encodings. However, this is
>> not the
>> case anymore for Python 3.2, which uses the locale encoding as the
>> filesystem
>> encoding (this is not true for later Python versions, but see below).
>> See
>> https://vstinner.github.io/painful-history-python-filesystem-encoding.html 
>> for
>> a source and more background information.
>>
>> Using os.fsencode() is safer, as the documentation for sys.argv says
>> that it can
>> be used to get the original bytes. When doing further changes, the
>> Python
>> developers will take care that this continues to work.
>>
>> One concrete case where os.fsencode() is more correct is when enabling
>> Python's
>> UTF-8 mode. Py_DecodeLocale() will use UTF-8 in this case. Our
>> previous code
>> would have encoded it using the locale encoding (which might be
>> different),
>> whereas os.fsencode() will encode it with UTF-8.
>>
>> Since we don’t claim to support the UTF-8 mode, this is not really a
>> bug and the
>> patch can go to the default branch. It might be a good idea to not
>> commit this
>> to the stable branch, as it could in theory introduce regressions.
>>
>> diff --git a/mercurial/pycompat.py b/mercurial/pycompat.py
>> --- a/mercurial/pycompat.py
>> +++ b/mercurial/pycompat.py
>> @@ -156,16 +156,10 @@
>>
>>      if getattr(sys, 'argv', None) is not None:
>>          # On POSIX, the char** argv array is converted to Python str
>> using
>> -        # Py_DecodeLocale(). The inverse of this is
>> Py_EncodeLocale(), which isn't
>> -        # directly callable from Python code. So, we need to emulate
>> it.
>> -        # Py_DecodeLocale() calls mbstowcs() and falls back to
>> mbrtowc() with
>> -        # surrogateescape error handling on failure. These functions
>> take the
>> -        # current system locale into account. So, the inverse
>> operation is to
>> -        # .encode() using the system locale's encoding and using the
>> -        # surrogateescape error handler. The only tricky part here is
>> getting
>> -        # the system encoding correct, since `locale.getlocale()` can
>> return
>> -        # None. We fall back to the filesystem encoding if lookups
>> via `locale`
>> -        # fail, as this seems like a reasonable thing to do.
>> +        # Py_DecodeLocale(). The inverse of this is
>> Py_EncodeLocale(), which
>> +        # isn't directly callable from Python code. In practice,
>> os.fsencode()
>> +        # can be used instead (this is recommended by Python's
>> documentation
>> +        # for sys.argv).
>>          #
>>          # On Windows, the wchar_t **argv is passed into the
>> interpreter as-is.
>>          # Like POSIX, we need to emulate what Py_EncodeLocale() would
>> do. But
>> @@ -178,12 +172,7 @@
>>          if os.name == r'nt':
>>              sysargv = [a.encode("mbcs", "ignore") for a in sys.argv]
>>          else:
>> -            encoding = (
>> -                locale.getlocale()[1]
>> -                or locale.getdefaultlocale()[1]
>> -                or sys.getfilesystemencoding()
>> -            )
>> -            sysargv = [a.encode(encoding, "surrogateescape") for a in
>> sys.argv]
>> +            sysargv = [fsencode(a) for a in sys.argv]
>>
>>      bytechr = struct.Struct('>B').pack
>>      byterepr = b'%r'.__mod__
>> _______________________________________________
>> Mercurial-devel mailing list
>> [hidden email]
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
> _______________________________________________
> Mercurial-devel mailing list
> [hidden email]
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] pycompat: use os.fsencode() to re-encode sys.argv

Manuel Jacob
Rebased on the WSGI environment patches it passes test-py3 heptapod CI:
https://foss.heptapod.net/octobus/mercurial-devel/-/jobs/81591

When run directly, it passed before and after the other patch.

I’ve sent an otherwise unmodified version of this patch rebased on a
merge from stable.

On 2020-06-25 02:05, Manuel Jacob wrote:

> It turns out that the test breakage is not at all because of the
> changes in the patch. It’s because of non-ASCII characters in the
> changeset description. See
> https://foss.heptapod.net/octobus/mercurial-devel/-/commit/e5930264da81a3a73b424efc46f114dbfb9bbe6d,
> which also fails.
>
> GitLab puts the changeset description in the environment variables
> (maybe after today’s update?), which hgweb tries to encode using
> latin-1. The solution is... more fsdecode()s. I hope this doesn’t
> sound too dubious to be true. :)
>
> On 2020-06-24 16:14, Manuel Jacob wrote:
>> This breaks tests
>> (https://foss.heptapod.net/octobus/mercurial-devel/pipelines/7240).
>> I’ll investigate.
>>
>> On 2020-06-24 14:57, Manuel Jacob wrote:
>>> # HG changeset patch
>>> # User Manuel Jacob <[hidden email]>
>>> # Date 1593002661 -7200
>>> #      Wed Jun 24 14:44:21 2020 +0200
>>> # Node ID 0bc2e245a6f89bd5797e234874733f31778aea67
>>> # Parent  c2df0bca0dfa30cdf177d37b852841002d670ee5
>>> # EXP-Topic use_os_fsencode
>>> pycompat: use os.fsencode() to re-encode sys.argv
>>>
>>> Historically, the previous code made sense, as Py_EncodeLocale() and
>>> fs.fsencode() could possibly use different encodings. However, this
>>> is not the
>>> case anymore for Python 3.2, which uses the locale encoding as the
>>> filesystem
>>> encoding (this is not true for later Python versions, but see below).
>>> See
>>> https://vstinner.github.io/painful-history-python-filesystem-encoding.html 
>>> for
>>> a source and more background information.
>>>
>>> Using os.fsencode() is safer, as the documentation for sys.argv says
>>> that it can
>>> be used to get the original bytes. When doing further changes, the
>>> Python
>>> developers will take care that this continues to work.
>>>
>>> One concrete case where os.fsencode() is more correct is when
>>> enabling Python's
>>> UTF-8 mode. Py_DecodeLocale() will use UTF-8 in this case. Our
>>> previous code
>>> would have encoded it using the locale encoding (which might be
>>> different),
>>> whereas os.fsencode() will encode it with UTF-8.
>>>
>>> Since we don’t claim to support the UTF-8 mode, this is not really a
>>> bug and the
>>> patch can go to the default branch. It might be a good idea to not
>>> commit this
>>> to the stable branch, as it could in theory introduce regressions.
>>>
>>> diff --git a/mercurial/pycompat.py b/mercurial/pycompat.py
>>> --- a/mercurial/pycompat.py
>>> +++ b/mercurial/pycompat.py
>>> @@ -156,16 +156,10 @@
>>>
>>>      if getattr(sys, 'argv', None) is not None:
>>>          # On POSIX, the char** argv array is converted to Python str
>>> using
>>> -        # Py_DecodeLocale(). The inverse of this is
>>> Py_EncodeLocale(), which isn't
>>> -        # directly callable from Python code. So, we need to emulate
>>> it.
>>> -        # Py_DecodeLocale() calls mbstowcs() and falls back to
>>> mbrtowc() with
>>> -        # surrogateescape error handling on failure. These functions
>>> take the
>>> -        # current system locale into account. So, the inverse
>>> operation is to
>>> -        # .encode() using the system locale's encoding and using the
>>> -        # surrogateescape error handler. The only tricky part here
>>> is getting
>>> -        # the system encoding correct, since `locale.getlocale()`
>>> can return
>>> -        # None. We fall back to the filesystem encoding if lookups
>>> via `locale`
>>> -        # fail, as this seems like a reasonable thing to do.
>>> +        # Py_DecodeLocale(). The inverse of this is
>>> Py_EncodeLocale(), which
>>> +        # isn't directly callable from Python code. In practice,
>>> os.fsencode()
>>> +        # can be used instead (this is recommended by Python's
>>> documentation
>>> +        # for sys.argv).
>>>          #
>>>          # On Windows, the wchar_t **argv is passed into the
>>> interpreter as-is.
>>>          # Like POSIX, we need to emulate what Py_EncodeLocale()
>>> would do. But
>>> @@ -178,12 +172,7 @@
>>>          if os.name == r'nt':
>>>              sysargv = [a.encode("mbcs", "ignore") for a in sys.argv]
>>>          else:
>>> -            encoding = (
>>> -                locale.getlocale()[1]
>>> -                or locale.getdefaultlocale()[1]
>>> -                or sys.getfilesystemencoding()
>>> -            )
>>> -            sysargv = [a.encode(encoding, "surrogateescape") for a
>>> in sys.argv]
>>> +            sysargv = [fsencode(a) for a in sys.argv]
>>>
>>>      bytechr = struct.Struct('>B').pack
>>>      byterepr = b'%r'.__mod__
>>> _______________________________________________
>>> Mercurial-devel mailing list
>>> [hidden email]
>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>> _______________________________________________
>> Mercurial-devel mailing list
>> [hidden email]
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
> _______________________________________________
> Mercurial-devel mailing list
> [hidden email]
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel