D8850: dirstate: restore original estimation and update comment

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

D8850: dirstate: restore original estimation and update comment

pulkit (Pulkit Goyal)
joerg.sonnenberger created this revision.
Herald added a reviewer: hg-reviewers.
Herald added a subscriber: mercurial-patches.

REVISION SUMMARY
  The former comment didn't reflect the content of the dirstate entries,
  the two nodes are a fixed header in the file and not per-entry. Split
  the documented entry into the path part and the fixed header. The
  heuristic itself is still valid, e.g. for the NetBSD src tree a maximum
  path size of 142 and an average of 49, resulting in 66 bytes per entry
  on average.

REPOSITORY
  rHG Mercurial

BRANCH
  default

REVISION DETAIL
  https://phab.mercurial-scm.org/D8850

AFFECTED FILES
  mercurial/dirstate.py

CHANGE DETAILS

diff --git a/mercurial/dirstate.py b/mercurial/dirstate.py
--- a/mercurial/dirstate.py
+++ b/mercurial/dirstate.py
@@ -1658,18 +1658,12 @@
 
         if util.safehasattr(parsers, b'dict_new_presized'):
             # Make an estimate of the number of files in the dirstate based on
-            # its size. From a linear regression on a set of real-world repos,
-            # all over 10,000 files, the size of a dirstate entry is 2 nodes
-            # plus 45 bytes. The cost of resizing is significantly higher than the cost
-            # of filling in a larger presized dict, so subtract 20% from the
-            # size.
-            #
-            # This heuristic is imperfect in many ways, so in a future dirstate
-            # format update it makes sense to just record the number of entries
-            # on write.
-            self._map = parsers.dict_new_presized(
-                len(st) // ((2 * self._nodelen + 45) * 4 // 5)
-            )
+            # its size. This trades wasting some memory for avoiding costly
+            # resizes. Each entry have a prefix of 17 bytes followed by one or
+            # two path names. Studies on various large-scale real-world repositories
+            # found 54 bytes a reasonable upper limit for the average path names.
+            # Copy entries are ignored for the sake of this estimate.
+            self._map = parsers.dict_new_presized(len(st) // 71)
 
         # Python's garbage collector triggers a GC each time a certain number
         # of container objects (the number being defined by



To: joerg.sonnenberger, #hg-reviewers
Cc: mercurial-patches, mercurial-devel
_______________________________________________
Mercurial-devel mailing list
[hidden email]
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel