#36 ✓invalid
Sune Marcher

Notepad_plus.cpp: rewrite of menuID2LangType()

Reported by Sune Marcher | February 3rd, 2010 @ 03:44 AM

Focusing on readability, ease of extension for new language types,
with smaller size as an added bonus - binary size statistics from
VS2008 release mode. Likely faster as well; function isn't really
speed-critical though.

Size statistics:
ORIGINAL: 125 lines, 371 bytes code, 299 bytes jumptable = 670 bytes
REWRITE: 41 lines, 53 bytes code, 54x2x1 table: 108 bytes = 161 bytes

Notes on implementation:
1) assumes, like original, that unknown cmdID maps to L_EXTERNAL.
2) requires that no IDM_LANG_xx is larger than IDM_LANG+254, since we
must encode in bytesize and 255 is reserved for end-table marker. Same goes for L_xx enum mapping, except a value of 255 can be used here. Use of >255 will be caught at compile-time, =255 won't. 3) It would be a good idea to introduce a L_START enum value with same
value as the first L_xx; better than assuming L_TXT will always be the first L_xx enum value. 4) Perhaps match up the IDM_LANG_xx/L_xx for the three special cases?

Comments and changes to this ticket

  • Sune Marcher

    Sune Marcher February 3rd, 2010 @ 06:39 AM

    • State changed from “new” to “needs_ack”
    • Assigned user set to “Sune Marcher”
  • Sune Marcher

    Sune Marcher February 3rd, 2010 @ 07:16 AM

    • Assigned user changed from “Sune Marcher” to “npp-community”

    (from [d0195ab4287c594215dc4fb011621817895b94c0]) Notepad_plus.cpp: rewrite of menuID2LangType()

    • rewrite for easier maintaining, as well as code&binary size.
    • See LH-36 for more information.

    [#36 state:needs_ack responsible:npp-community]

    Signed-off-by: Sune Marcher sm@flork.dk
    http://github.com/snemarch/npp-community/commit/d0195ab4287c594215d...

  • Sune Marcher

    Sune Marcher February 3rd, 2010 @ 07:19 AM

    (from [cd8e23474d39fee53b79c02e34e58de551dedf8d]) Notepad_plus.cpp: rewrite of menuID2LangType()

    • Focusing on readability, ease of extension for new language types, with smaller size as an added bonus - binary size statistics from VS2008 release mode. Likely faster as well; function isn't really speed-critical though.

    • Size statistics: ORIGINAL: 125 lines, 371 bytes code, 299 bytes jumptable = 670 bytes REWRITE: 41 lines, 53 bytes code, 54x2x1 table: 108 bytes = 161 bytes

    [#36]

    Signed-off-by: Sune Marcher sm@flork.dk
    http://github.com/snemarch/npp-community/commit/cd8e23474d39fee53b7...

  • Sune Marcher

    Sune Marcher February 3rd, 2010 @ 12:16 PM

    (from [4317709c9499ee0e6dca6d81f23da9f430b6c917]) Notepad_plus.cpp: rewrite of menuID2LangType()

    • Focusing on readability, ease of extension for new language types, with smaller size as an added bonus - binary size statistics from VS2008 release mode. Likely faster as well; function isn't really speed-critical though.

    • Size statistics:

      • ORIGINAL: 125 lines, 371 bytes code, 299 bytes jumptable = 670 bytes
      • REWRITE: 41 lines, 53 bytes code, 54x2x1 table: 108 bytes = 161 bytes
    • Hopefully fixed git version history that I messed up previously :)

    [#36]

    Signed-off-by: Sune Marcher sm@flork.dk
    http://github.com/snemarch/npp-community/commit/4317709c9499ee0e6dc...

  • Thell Fowler

    Thell Fowler February 3rd, 2010 @ 06:23 PM

    • Load time using this patch takes 40% longer. Originally tested opening with 10 randomly chosen example files the average jumped from 5.4 to 8.4 seconds averaged over three runs. This seemed quite extreme so I checked using 10 iterations using a debug build with a session file that opens extremely small hello world example files for most of the language types recognized by notepad++. 40 in all iirc. This was then measured using powershell.exe, cbd.exe, and notepad++Debug.exe with the following command::

    PS C:\Program Files\Debugging Tools for Windows (x86)> 1..10|measure-command {$cdbRun = .\cdb.exe -kqm -g -G -y 'c:\users\almostautomated\datastore\repos\git\npp-community-setup\PowerEditor\bin\' 'c:\users\almostautomated\datastore\repos\git\npp-community-setup\PowerEditor\bin\notepad++Debug.exe' -leakdetect}

    • Personally I don't like horizontal lines of values when vertical will do as it requires the whole block needing to be changed in order to add one entry.
  • Jocelyn Legault

    Jocelyn Legault February 3rd, 2010 @ 08:36 PM

    • Code size and line numbers isn't a metric we really do care about. That does not mean we encourage code bloat; However readability, stability, reusability and maintainability (in no particular order) will always prime over code size.
    • Grouping your values by 3's doesn't improve readability, but rather deteriorates it IMO.
    • While we don't have a written set of coding standards, we still expect coders to use braces for control flows. Missing braces when (potentially) adding lines to code blocks in the future is a risk that we do not want to take. As for whether the braces should be opened on the same line as the conditional or looping statement, it should match the surrounding code. Closing braces are always on their own lines.
    • We're a C++ project and it's my belief that we should use C++ constructs whenever possible. To that effect, if you want to represent a map, then use an std::map instead of one off construct
    • We never use magic numbers (i.e. 0xFF). Instead, use a #define or a const to represent your value (MAP_END, for example).
  • Sune Marcher

    Sune Marcher February 4th, 2010 @ 10:02 AM

    • State changed from “needs_ack” to “open”
    • Assigned user changed from “npp-community” to “Sune Marcher”

    First: thanks for the feedback, guys.

    Jocelyn

    Grouping your values by 3's doesn't improve readability, but rather deteriorates it IMO.

    Well, the entries don't need to be in order, so you can just add to the bottom - personally I feel the multiple values per line gives a better overview, but this is of course a matter of taste. I only tend to use multiple things per line for things like listing data values - not "normal" code.

    Code size and line numbers isn't a metric we really do care about. That does not mean we encourage code bloat; However readability, stability, reusability and maintainability (in no particular order) will always prime over code size.

    IMHO line numbers can mean a lot wrt. readability, reusability and maintainability. Smaller doesn't necessarily mean better (I prefer several lines of code over obscure one-liners, and decent comments altso take up lines). OTOH there's several examples in the NP++ codebase with large amounts of code duplication - generalizing code results in less codelines and easier maintenance.

    I agree on (binary) code size not being an important factor, I just got a bit carried away :). Btw I don't feel that menuID2LangType() is the most important thing to focus on, but I deliberately wanted to start with something simple & limited in scope, and get used to how things work here.

    we still expect coders to use braces for control flows. Missing braces when (potentially) adding lines to code blocks in the future is a risk that we do not want to take

    I generally agree with this, and often use braces for one-line if/else in my own code. I do, however, feel it's OK to omit them in short and clear functions - but I don't mind following standards and always adding them. While "wasting" code lines, I prefer open-braces on their own lines, so they match up with close-braces (which I find valuable even with editors that have brace-match highlight).

    We're a C++ project and it's my belief that we should use C++ constructs whenever possible. To that effect, if you want to represent a map, then use an std::map instead of one off construct

    That's something I also generally agree with - std::map is great for a lot of stuff. But it requires setting up that can't be done at compile-time (leads to singleton pattern or manually calling a setup/teardown), and for highly continuous/packed values like this, it's not really optimal (std::map implementation performance is almost identical to the code I submitted: i.e., not optimal). std::map is really great for more dynamic values though, or values-keys with more interspersed distribution.

    We never use magic numbers (i.e. 0xFF). Instead, use a #define or a const to represent your value (MAP_END, for example).

    Agree 100%.

    Thell

    Load time using this patch takes 40% longer. Originally tested opening with 10 randomly chosen example files the average jumped from 5.4 to 8.4 seconds averaged over three runs.

    That's interesting - I didn't really expect this function to be a critical code path. I ended up writing a little test suite (correctness + speed), and it does indeed turn out that my routine is ~10x as slow as the original (no wonder, really, since the switch cases are so tightly packed that VC2008 generates a jumptable rather than the typical binary-search conditional jumps).

    However, on my 2GHz laptop, running 5.000.000 iterations of 17 lookups, the original takes ~450ms and my new version ~4200ms. Considering I need ~5mil iterations x 17 lookups to get ~4s speed difference, your slowdown sounds a bit extreme. Either you have a debug build (with some wildly suboptimal code generation O_o), or there's something quirky wrt. how often np++ calls the function?

    I did another test implementation that simply does a direct table lookup, but this requires runtime setup (or build-step generated table - you don't want to maintain such a table by hand). It's only ~100ms faster than switch/case for those 5mil*17 lookups, so not worth it. The optimal solution would be having IDM_LANG_xx implemented in terms of L_XX, which would result in very simple & efficient code. This only saves ~100ms over the direct-table-lookup, though, and needs some code restructure - I'll look into whether this is feasible.

    Anyway, this is turning into somewhat of a micro-optimization issue, which isn't what I'm intending to contribute to the project (even if I do like that kind of stuff). I still feel that menuID2LangType(), along with NppParameters::langTypeToCommandID(), can be rewritten cleaner - I'll pursue this for a bit, since I believe it to be one of the simpler cleanups to do, and a good lesson in contributing to np++cr.

    I'll scrap the current code implementation, but will continue this ticket. Sounds OK to you guys?

  • Jocelyn Legault

    Jocelyn Legault February 4th, 2010 @ 05:26 PM

    OTOH there's several examples in the NP++ codebase with large amounts of code duplication - generalizing code results in less codelines and easier maintenance.

    That is very true, and the problems with duplication need to be fixed. However, this is not such a problem.

    it's not really optimal (std::map implementation performance is almost identical to the code I submitted: i.e., not optimal).

    std::map implementation uses a tree, and is therefore a O(log(n)) search. Yours is O(n), which is worst. Funnily enough, the current implementation with the switch statement is O(1), which is the best possible. Take a look at the disassembly to convince yourself:

    Disassembly

    I would therefore not touch that code, even to replace it with a direct lookup table. You'd end up with an array filled with the L_ values and you'd return langArray[cmdID - IDM_LANG_C]; or something similar. That would not help visibility either... So, all in all, and as weird as it sounds (even to me), I think the implementation that is currently there is probably the best possible in the circumstances.

    That being said, Notepad_plus.cpp is a scary big file filled with tons of unrelated things that were bunched up together for no obvious reasons... Maybe yanking that function out of there and into a more sensible object could make sense...

    Finally, I'd like to point to the Zen of Python that I try to use as much as possible as a yardstick to determine how to code stuff... Maybe you'll understand better where I'm coming from with my comments with that.

  • Thell Fowler

    Thell Fowler February 4th, 2010 @ 11:20 PM

    Sune,

    However, on my 2GHz laptop, running 5.000.000 iterations of 17 lookups, the original takes ~450ms and my new version ~4200ms. Considering I need ~5mil iterations x 17 lookups to get ~4s speed difference, your slowdown sounds a bit extreme. Either you have a debug build (with some wildly suboptimal code generation O_o), or there's something quirky wrt. how often np++ calls the function?

    Of course I mentioned exactly how I did my testing and what build I used. I run a debug build as do lots and lots of developers and plugin writers. And since cdb just starts and loads N++, and the -leakdetect flag just forces an exit after init, I feel this test represents a fairly good example.

  • Sune Marcher

    Sune Marcher February 5th, 2010 @ 12:37 AM

    std::map implementation uses a tree, and is therefore a O(log(n)) search. Yours is O(n), which is worst. Funnily enough, the current implementation with the switch statement is O(1), which is the best possible. Take a look at the disassembly to convince yourself.

    I know about the implementation of std::map, and about big-O - and I know enough about it to know that it isn't everything :). There's so much overhead in std::map that, for the small amount of values, is only ~5% faster than the O(n) search. I could also give examples of bubblesort beating quick- and heapsort :)

    Funnily enough, the current implementation with the switch statement is O(1), which is the best possible.

    Yep, already mentioned this above. It's too bad it's done with a jumptable, it would be nice if the compiler was smart enough to turn it into a LUT instead. Anyway, the point wasn't really speed but what I belived to be neater code (and then I got obsessed with the size optimization).

    I would therefore not touch that code, even to replace it with a direct lookup table. You'd end up with an array filled with the L_ values and you'd return langArray[cmdID - IDM_LANG_C]; or something similar. That would not help visibility either...

    Yep, already mentioned it's not worth it - it does give a fair reduction in binary code size, but the speed advantage is hardly measurable.

    So, all in all, and as weird as it sounds (even to me), I think the implementation that is currently there is probably the best possible in the circumstances.

    IMHO the best would be, as mentioned above, would be implementing the IDM_LANG_xx in terms of L_xx - but I have to do a bit of code digging to see if this is a feasible solution. Until then, the switch solution is indeed the best. I think it'd still benefit from MAP_xx style macros (even without multiple cases on one line :)), but I won't spend time on that; instead I'll see if the "IDM_LANG_xx in terms of L_x" is a realistic thing to do - if it isn't, I'll close this ticket.

    That being said, Notepad_plus.cpp is a scary big file filled with tons of unrelated things that were bunched up together for no obvious reasons... Maybe yanking that function out of there and into a more sensible object could make sense...

    Indeed. It's a big monolithic file that should be cleaned up and broken into smaller pieces. I know this is probably not the most important task, but imho it's worth doing (will ease maintenance in the long run), and a good way to get familiar with the codebase. I also expect that other people are more interested in other aspects, so I can probably do some janitoring without duplicating other people's effort?

    Finally, I'd like to point to the Zen of Python that I try to use as much as possible as a yardstick to determine how to code stuff... Maybe you'll understand better where I'm coming from with my comments with that.

    It's a good little piece, came across it years ago, and I pretty much agree with it - also this part: Although practicality beats purity. :) - also note that I didn't consider the rewrite final and ready for consumption, but wanted some feedback. You've given that, and it has been constructive - thanks!

    Of course I mentioned exactly how I did my testing and what build I used. I run a debug build as do lots and lots of developers and plugin writers. And since cdb just starts and loads N++, and the -leakdetect flag just forces an exit after init, I feel this test represents a fairly good example.

    Sorry if I came off as "you must be doing something weird", that wasn't the intention - I'm just puzzled at how this function can cause a speed hit, and I'm going to investigate further :)

  • Sune Marcher

    Sune Marcher February 5th, 2010 @ 01:43 AM

    Quick note: tested loading those 41 language files, only see a ~300ms time diff on the calls to menuID2LangType() calls, with debug version - no wonder I didn't feel a performance hit. Note to self: test on slower hardware :)

    Confirmed that you only get the expected number of calls - one per loaded file; good.

  • Thell Fowler

    Thell Fowler February 20th, 2010 @ 05:40 PM

    Sune,

    I'm wondering if you did the same test I did. If not your ~300ms * 10 iterations would be just about the 3 sec diff I noted. ;)

    Anyhow. So are you dropping this ticket, or coming up with a different method, or what?

  • Sune Marcher

    Sune Marcher February 21st, 2010 @ 02:49 PM

    I did the test slightly differently - OutputDebugString on entry/exit of the function. Started NP++ with a session including those language files, and then subtracted first entry-tickcount from the last exit-tickcount. This obviously measures slightly more code than just the function, but since I subtracted the timings from the old version, it's good enough for measuring overhead. The 300ms is the slowdown for loading all the 41 test files, running debug builds on my Q6600 workstation.

    Anyway, I've been digging deeper into the NP++ source code, and the Proper Thing To Do is a larger rewrite of various language-related functions. So I don't know if this ticket should be dropped in favor of a new ticket with a better description, considering it's a larger task than just rewriting menuID2LangType?

    I have to reinstall my workstation before proceeding with the code, though - I'm currently running Win7 RC, and it's complaining that it'll deactivate in 8 days or so :)

  • Jocelyn Legault

    Jocelyn Legault February 22nd, 2010 @ 10:00 PM

    You probably just drop this ticket and open new ones for the other tasks.
    Please keep the tasks at a reasonable scope. "Rewrite language-related
    functions" is probably too large and too vague.

    Which reminds me that I should fix some of my older broad spectrum tickets
    to something more specific (i.e. #17: Fix Parameters.cpp).

    cheers,

    joce.

  • Sune Marcher

    Sune Marcher February 22nd, 2010 @ 11:52 PM

    Dropping this ticket is probably the best idea, yes.

    And yes, "Rewrite language-related functions" is too vague - what I'm particularly aiming at is L_xx and IDM_LANG_xx (ie., not internationalization or much to do with user-defined languages). As it is now, if you change language stuff, you need to edit menuCmdID.h, Notepad_plus.rc, Notepad_plus.cpp, Parameters.cpp (did I forget any? :)). It's a bit hard reducing the scope of the edits, but it can be done as a series of commits if needed. I'm afraid that a fair amount of the refactoring NP++ needs will have similar scope, because of the spaghetti nature of the source.

    I hope I'll have time to try my hand at this during the weekend, but I need to dig a bit deeper to make sure what I have in mind can be implemented without breaking anything. I'll try to catch you guys on IRC before I start hacking away.

  • Thell Fowler

    Thell Fowler February 26th, 2010 @ 05:33 PM

    • State changed from “open” to “invalid”

    Okie dokie then, We'll mark this ticket as invalid.

    BTW - take a look at npp_debug and FuncGuards. It'll save you time of playing around with OutputDebugString messages.

    Also, imo, if you're going to test the performance impact of a change isn't it better to test it from the user's perspective and include a set number of iterations? That's why I chose to:

    • do 10 iterations.
    • use a performance counter that anyone with Windows XP SP2 and greater has access to.
    • measure the full startup-shutdown sequence both pre and post patch implementation.

    Just my 2 cents, but it seems more realistic that way.

    I liked Don's idea of a timer, but the implementation leaves me wanting...

    Thell

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Notepad++ Community Release

Shared Ticket Bins

Attachments