Is there a way to configure system-wide word delimiters for text in GNOME/GTK+?

I'm looking for a way to define or view and modify which characters are considered word boundaries by GNOME, similar to the "select-by-word" characters used in GNOME Terminal (related question) or cutchars in rxvt.

For example, when I am using gedit I can type abc123xyz and double-click, but only three characters will be selected, since the numbers are treated as a boundary by default. I see the same behavior in Tomboy, and at least a few other other GNOME/GTK+ applications I've tested. It's also present--and particularly annoying--inside the location bar (but not inside pages) of Google Chrome. Firefox seems OK, I believe since its UI is written in XUL, which is doing its own thing.

Please note that I am looking for a general solution, if one exists--I already know that there are plugins for gedit in particular, and maybe other per-application tweaks, but I want to know if this can be done in a way that affects most or all programs that inherit this behavior from the window manager (I presume?).

Noyo

Posted 2013-09-25T17:54:13.183

Reputation: 139

Please also let me know if you can't reproduce this behavior. If that's the case, I suspect it may have something to do with locales.. I'm on Debian wheezy, using GNOME 3, en_US.UTF-8 locale (sometimes fr_FR.UTF-8). – Noyo – 2013-09-26T09:35:57.270

More research hints that it's maybe not locale-related, but rather a well-established mystery related to the way (all?) Gtk+ applications seem to behave by default: https://mail.gnome.org/archives/gtk-list/2011-June/msg00060.html and https://mail.gnome.org/archives/gtk-i18n-list/2011-June/msg00003.html

– Noyo – 2013-09-26T13:31:09.480

Also: http://forums.opensuse.org/english/other-forums/looking-something-other-than-support/461854-anyone-know-how-change-pangos-word-separators.html

– Noyo – 2013-09-26T13:37:27.533

Answers

Apparently, this was fixed in version 1.34.1. Check the git log and the commits between 1.34.0 and 1.34.1 tags. Ubuntu 13.04 has version 1.32.5 of libpango1.0-0 same as sid. So, no joy for now for Debian-based distributions. If you need the library, you may compile it from sources.

This is an issue with the PangoLogAttr() function that seems that nobody has reported a bug complaining about it. The issue goes back to 2003 till the current time. Appart of the mails that you supply there are:

https://mail.gnome.org/archives/gtk-i18n-list/2011-August/msg00000.html
http://forums.opensuse.org/english/other-forums/looking-something-other-than-support/461854-anyone-know-how-change-pangos-word-separators.html
The pango blueprint proposal http://www.levien.com/gnome/pango-0.1.html
The function that defines the words boundary which are following the "Word boundary as defined by UAX#29" https://developer.gnome.org/pango/unstable/pango-Text-Processing.html#PangoLogAttr
The unicode guidelines word boundaries pango is following http://www.unicode.org/reports/tr29/#Word_Boundaries

If my interpretation of the guidelines is correct, then it's expected that a word containing numbers, the word is limited by the numbers the same that by spaces and symbols except the '.

To report bugs against pango, just visit this link https://bugzilla.gnome.org/enter_bug.cgi?product=pango

Braiam

Posted 2013-09-25T17:54:13.183

Reputation: 4 526

Good research, thanks! Strange, though, that even the example given in the guidelines doesn't follow the default of having . as a boundary.. In any case, it may be more of a feature request than a bug. Can you include a link to where one might file a bug? – Noyo – 2013-10-02T14:50:56.930

Thanks again! After some quick searching, it does look like people want this and have indeed filed bugs/feature requests: https://bugzilla.gnome.org/show_bug.cgi?id=111503 and https://bugzilla.gnome.org/show_bug.cgi?id=530427 and https://bugzilla.gnome.org/show_bug.cgi?id=97545 at least. Maybe I'll just follow/comment on one of those for now.

– Noyo – 2013-10-02T16:39:32.583

Actually, according to that last one, a fix for my specific "abcc123" example has been committed: https://bug97545.bugzilla-attachments.gnome.org/attachment.cgi?id=242980 . No idea though when that version makes it to my Linux distro.

– Noyo – 2013-10-02T16:52:11.347

It is merged, but it's much more recent than the link in your comment. See here: https://git.gnome.org/browse/pango/commit/pango/break.c?id=1aeb5c840e25a7d8538f701659d77dcd7b3a8444 . Now sure if that version is included in any stable version of a Linux distro at the moment.

– Noyo – 2013-10-02T17:34:23.133

I realized that, even though you do thoroughly answer to the specific case of the word/number boundary annoyance I mentioned (indeed, the main problem I personally wanted fixed), the core of the question hasn't completely been answered. Can you edit your answer to additionally include references to the three issues I mentioned in the second comment and briefly assess the status of those? I'm not sure what you say about PangoLogAttr() is correct, and there have also indeed been issues filed related to the feature. Your research definitely deserved the bounty though, thanks yet again! – Noyo – 2013-10-04T14:32:35.633

For Debian 7 (Wheezy):

You can download the source files from Debian and make the changes yourself, then recompile and install the created .deb packages:

Open a root terminal:

apt-get install dpkg-dev;
apt-get build-dep libpango1.0-0;
exit;

Open a regular terminal:

cd; mkdir patch-libpango; cd patch-libpango;
apt-get source libpango1.0-0;

Now go to your home folder and open the file patch-libpango/pango1.0-1.30.0/pango/break.c, then find this block of code:

/* ---- Word breaks ---- */

/* default to not a word start/end */
attrs[i].is_word_start = FALSE;
attrs[i].is_word_end = FALSE;

if (current_word_type != WordNone)
{
    /* Check for a word end */
    switch ((int) type)
    {
    case G_UNICODE_SPACING_MARK:
    case G_UNICODE_ENCLOSING_MARK:
    case G_UNICODE_NON_SPACING_MARK:
    case G_UNICODE_FORMAT:
        /* nothing, we just eat these up as part of the word */
        break;

    case G_UNICODE_LOWERCASE_LETTER:
    case G_UNICODE_MODIFIER_LETTER:
    case G_UNICODE_OTHER_LETTER:
    case G_UNICODE_TITLECASE_LETTER:
    case G_UNICODE_UPPERCASE_LETTER:
        if (current_word_type == WordLetters)
        {
            /* Japanese special cases for ending the word */
            if (JAPANESE (last_word_letter) ||
                    JAPANESE (wc))
            {
                if ((HIRAGANA (last_word_letter) &&
                            !HIRAGANA (wc)) ||
                        (KATAKANA (last_word_letter) &&
                            !(KATAKANA (wc) || HIRAGANA (wc))) ||
                        (KANJI (last_word_letter) &&
                            !(HIRAGANA (wc) || KANJI (wc))) ||
                        (JAPANESE (last_word_letter) &&
                            !JAPANESE (wc)) ||
                        (!JAPANESE (last_word_letter) &&
                            JAPANESE (wc)))
                attrs[i].is_word_end = TRUE;
            }
        }
        else
        {
            /* end the number word, start the letter word */
            attrs[i].is_word_end = TRUE;
            attrs[i].is_word_start = TRUE;
            current_word_type = WordLetters;
        }

        last_word_letter = wc;
        break;

    case G_UNICODE_DECIMAL_NUMBER:
    case G_UNICODE_LETTER_NUMBER:
    case G_UNICODE_OTHER_NUMBER:
        if (current_word_type != WordNumbers)
        {
            attrs[i].is_word_end = TRUE;
            attrs[i].is_word_start = TRUE;
            current_word_type = WordNumbers;
        }

        last_word_letter = wc;
        break;

    default:
        /* Punctuation, control/format chars, etc. all end a word. */
        attrs[i].is_word_end = TRUE;
        current_word_type = WordNone;
        break;
    }
}
else
{
    /* Check for a word start */
    switch ((int) type)
    {
    case G_UNICODE_LOWERCASE_LETTER:
    case G_UNICODE_MODIFIER_LETTER:
    case G_UNICODE_OTHER_LETTER:
    case G_UNICODE_TITLECASE_LETTER:
    case G_UNICODE_UPPERCASE_LETTER:
        current_word_type = WordLetters;
        last_word_letter = wc;
        attrs[i].is_word_start = TRUE;
        break;

    case G_UNICODE_DECIMAL_NUMBER:
    case G_UNICODE_LETTER_NUMBER:
    case G_UNICODE_OTHER_NUMBER:
        current_word_type = WordNumbers;
        last_word_letter = wc;
        attrs[i].is_word_start = TRUE;
        break;

    default:
        /* No word here */
        break;
    }
}

and replace it with this:

/* ---- Word breaks ---- */

/* default to not a word start/end */
attrs[i].is_word_start = FALSE;
attrs[i].is_word_end = FALSE;

if (current_word_type != WordNone)
{
    /* Check for a word end */
    switch ((int) type)
    {
    case G_UNICODE_SPACING_MARK:
    case G_UNICODE_ENCLOSING_MARK:
    case G_UNICODE_NON_SPACING_MARK:
    case G_UNICODE_FORMAT:
        /* nothing, we just eat these up as part of the word */
        break;

    case G_UNICODE_LOWERCASE_LETTER:
    case G_UNICODE_MODIFIER_LETTER:
    case G_UNICODE_OTHER_LETTER:
    case G_UNICODE_TITLECASE_LETTER:
    case G_UNICODE_UPPERCASE_LETTER:
        if (current_word_type == WordLetters)
        {
            /* Japanese special cases for ending the word */
            if (JAPANESE (last_word_letter) ||
                    JAPANESE (wc))
            {
                if ((HIRAGANA (last_word_letter) &&
                            !HIRAGANA (wc)) ||
                        (KATAKANA (last_word_letter) &&
                            !(KATAKANA (wc) || HIRAGANA (wc))) ||
                        (KANJI (last_word_letter) &&
                            !(HIRAGANA (wc) || KANJI (wc))) ||
                        (JAPANESE (last_word_letter) &&
                            !JAPANESE (wc)) ||
                        (!JAPANESE (last_word_letter) &&
                            JAPANESE (wc)))
                attrs[i].is_word_end = TRUE;
            }
        }

        last_word_letter = wc;
        break;

    case G_UNICODE_DECIMAL_NUMBER:
    case G_UNICODE_LETTER_NUMBER:
    case G_UNICODE_OTHER_NUMBER:

        last_word_letter = wc;
        break;

    default:
        if (wc == 0x005F) break; //underscore
        /* Punctuation, control/format chars, etc. all end a word. */
        attrs[i].is_word_end = TRUE;
        current_word_type = WordNone;
        break;
    }
}
else
{
    /* Check for a word start */
    switch ((int) type)
    {
    case G_UNICODE_LOWERCASE_LETTER:
    case G_UNICODE_MODIFIER_LETTER:
    case G_UNICODE_OTHER_LETTER:
    case G_UNICODE_TITLECASE_LETTER:
    case G_UNICODE_UPPERCASE_LETTER:
        current_word_type = WordLetters;
        last_word_letter = wc;
        attrs[i].is_word_start = TRUE;
        break;

    case G_UNICODE_DECIMAL_NUMBER:
    case G_UNICODE_LETTER_NUMBER:
    case G_UNICODE_OTHER_NUMBER:
        current_word_type = WordNumbers;
        last_word_letter = wc;
        attrs[i].is_word_start = TRUE;
        break;

    default:
        /* No word here */
        break;
    }
}

Go back to your regular terminal:

cd ~/patch-libpango/pango*;
dpkg-buildpackage -rfakeroot -uc -b;

Now go to your home folder and open the folder patch-libpango, you should find some .deb files there. Install them all except for the debug and doc packages (the ones that have -dbg and -doc in their filename)

You can now delete the patch-libpango directory, go back to your regular terminal:

cd; rm -rf patch-libpango;

Done, you don't need to restart your system.

Note: this will also treat the underscore as part of a word (find 0x005F in the edited code).

References:

Brahim Raddahi

Posted 2013-09-25T17:54:13.183

Reputation: 1