How long is a Welsh word?

37

1

Write a program or function which receives as input a string representing a Welsh word (UTF-8 unless otherwise specified by you).

The following are all single letters in Welsh:

a, b, c, ch, d, dd, e, f, ff, g, ng, h, i, j, l, ll, m, n, o, p, ph, r, rh, s, t, th, u, w, y

To quote Wikipedia,

While the digraphs ch, dd, ff, ng, ll, ph, rh, th are each written with two symbols, they are all considered to be single letters. This means, for example that Llanelli (a town in South Wales) is considered to have only six letters in Welsh, compared to eight letters in English.

These letters also exist in Welsh, though they are restricted to technical vocabulary borrowed from other languages:

k, q, v, x, z

Letters with diacritics are not regarded as separate letters, but your function must accept them and be able to count them. Possible such letters are:

â, ê, î, ô, û, ŷ, ŵ, á, é, í, ó, ú, ý, ẃ, ä, ë, ï, ö, ü, ÿ, ẅ, à, è, ì, ò, ù, ẁ

(This means that ASCII is not an acceptable input encoding, as it cannot encode these characters.)

Notes:

  • This is code golf.
  • You do not have to account for words like llongyfarch, in which the ng is not a digraph, but two separate letters. This word has nine letters, but you can miscount it as eight. (If you can account for such words, that's kind of awesome, but outside the scope of this challenge.)
  • The input is guaranteed to have no whitespace (unless you prefer it with a single trailing newline (or something more esoteric), in which case that can be provided). There will certainly be no internal whitespace.

Test cases:

  • Llandudno, 8
  • Llanelli, 6
  • Rhyl, 3
  • Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch, 50 (really 51, but we'll count 50)
  • Tŷr, 3
  • Cymru, 5
  • Glyndŵr, 7

TRiG

Posted 2016-09-12T17:01:20.077

Reputation: 609

4Can the input be given in all lowercase? – ETHproductions – 2016-09-12T17:11:11.467

15My wife who is a native Welsh speaker would recommend that the J is added into the "Borrowed" letters section as it isn't actually part of the Welsh alphabet – Rich Starkie – 2016-09-12T20:29:06.707

@RichStarkie The Wikipedia article was a little vague on that front. My understanding is that j is used in borrowed words even when it's not present in the original word, so it's used phonologically, which implies that at this stage it's natualized into the language. I've seen similar arguments about v in Irish. It's widely considered not to be part of the Irish alphabet, but it exists in some Irish names, such as Ó Cuiv. – TRiG – 2016-09-12T21:06:58.603

What is it with "Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch" that makes it 51? – Erik the Outgolfer – 2016-09-13T13:02:22.723

@EriktheGolfer I think it's an ng which crosses a morpheme boundary, making it two separate letters, not a digraph. – TRiG – 2016-09-13T13:09:01.590

@TRiG yngy is four letters? – Erik the Outgolfer – 2016-09-13T13:10:27.887

@EriktheGolfer. Probably, but for the purpose of this question we'll call it three. – TRiG – 2016-09-13T13:25:16.457

I seem to recall from my Welsh lessons that nh and ngh are single letters, too. As in "fy nhadau" and "yng Nghymru". – megaflop – 2016-09-13T15:06:41.787

@daiscog. I'm just relying on Wikipedia's article on Welsh orthography. That said, https://en.wikipedia.org/wiki/Nh_(digraph)#Welsh does list nh as a Welsh digraph, even if the other article doesn't. Interesting. Too late to change the question at this stage, though.

– TRiG – 2016-09-13T15:10:43.027

1

And a footnote in the Welsh orthography article lists mh, nh, and ngh as graphems. Methinks I need to open a question on Linguistics SE.

– TRiG – 2016-09-13T15:13:19.830

3Shame it's too late; that triple-glyphed "ngh" might have made it a little more complicated. – megaflop – 2016-09-13T15:15:59.127

@Rich Starkie, why does your wife not leave her own comment? Also you don't happen to be called Ringo do you? – Octopus – 2016-09-13T18:09:01.620

Answers

6

05AB1E, 24 23 21 bytes

Code:

u•éÓœ°D¥M™ù>•30B2ô0:g

Explanation:

u                      # Convert the input to uppercase.
 •éÓœ°D¥M™ù>•30B       # Compressed version of CHDDFFNGLLPHRHTH.
                         It convert the text between the •'s from base 214 to
                         base 10 and converts that to base 30.
                2ô     # Split into pieces of 2.
                  0:   # Replace each element that also occurs in the input by 0.
                    g  # Get the length of the processed input.

Uses the CP-1252 encoding. Try it online!

Adnan

Posted 2016-09-12T17:01:20.077

Reputation: 41 965

16

Retina, 23 bytes

i`[cprt]h|dd|ff|ng|ll|.

Try it online!

Even moar regex.

user48538

Posted 2016-09-12T17:01:20.077

Reputation: 1 478

It's probably my ignorance of Retina, but where is the outputting of the length of the input text? The documentation on Retina doesn't seem to explain how that's working in the "Try it online!" site. – Xaero Degreaz – 2016-09-13T14:38:53.400

2The output is implicit, because the only line is a Match stage, returning the number of matches. Here, the regex matches every Welsh letter. – user48538 – 2016-09-13T15:17:51.747

So by that logic, then every answer below where the length is explicitly called in the code can be shortened? – Xaero Degreaz – 2016-09-13T15:29:09.217

2@XaeroDegreaz Retina is one of the only languages that automatically counts matches and prints them out. This is how Retina, the language, works. It is not how other languages work, and so those languages need to call their length functions explicitly to get the right output. – isaacg – 2016-09-13T15:42:25.637

Thanks, I understand now. After reading more into the documentation I see the default "Match" stage performs this output. – Xaero Degreaz – 2016-09-13T16:02:14.380

5

JavaScript (ES6), 44 bytes

x=>x.match(/[cprt]h|dd|ff|ng|ll|./gi).length

The trivial answer may be the shortest.

ETHproductions

Posted 2016-09-12T17:01:20.077

Reputation: 47 880

5

BASH 52 50 (sed + wc) 41

-9 thanks to Jordan

sed -r 's,dd|ff|ng|ll|[cprt]h,1,gi'|wc -m

If uppercase letter are required this needs an i at the end of the sed command. (I left it out because all of the "single letters" in the question are lowercase even though some examples aren't).

Riley

Posted 2016-09-12T17:01:20.077

Reputation: 11 345

1Why grep -o .|wc -l instead of wc -c? – Jordan – 2016-09-13T05:04:16.437

wc -c counts â through ẁ as two. – Riley – 2016-09-13T12:27:16.510

Ah, of course. FWIW if you use GNU or BSD wc you can use -m to count characters instead of bytes. – Jordan – 2016-09-13T12:56:42.347

Can you move the c from ch in with the [prt]? sed -r 's,dd|ff|ng|ll|[cprt]h,1,gi'|wc -m – megaflop – 2016-09-13T15:15:01.670

@daiscog I thought i did that already... That's how I went from 52 to 50. It must have reverted somehow. Thanks – Riley – 2016-09-13T15:19:19.143

2It's a shame ([dfl])\1 would longer than dd|ff|ll. Just one more doubled-consonant would favour the clever version. – Toby Speight – 2016-11-11T14:51:26.790

4

Straw, 30 58 35 33 bytes

<((?i:[cprt]h|dd|ff|ng|ll|.))0/$>

Replace each occurence of the regex by 0, and convert from unary to decimal.

Sadly, Straw can't pass flags to regexs. I forget about the ?flags: construct

Try it online! (The added code is to verify all test cases)

TuxCrafting

Posted 2016-09-12T17:01:20.077

Reputation: 4 547

How does this language differ from something like Retina? – Downgoat – 2016-09-12T21:33:00.627

@Downgoat Straw is stack-based :P – TuxCrafting – 2016-09-12T21:34:41.653

3

Python 3, 64 bytes

import re
print(len(re.findall("[cprt]h|dd|ff|ng|ll|.",input())))

Uses regex again

Ideone it!

Beta Decay

Posted 2016-09-12T17:01:20.077

Reputation: 21 478

3

PowerShell v2+, 52 50 48 bytes

($args[0]-replace'dd|ff|ng|ll|[prtc]h',0).length

Does a -replace on all the two-symbol-single-letter letters, changes 'em to 0 (done because changing to a non-numeral would require quotes), then gets the .length of the resultant string.

Test cases

PS C:\Tools\Scripts\golfing> 'Llandudno','Llanelli','Rhyl','Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch','Tŷr','Cymru','Glyndŵr'|%{"$_ --> "+(.\how-long-is-a-welsh-word.ps1 $_)}
Llandudno --> 8
Llanelli --> 6
Rhyl --> 3
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch --> 50
Tŷr --> 3
Cymru --> 5
Glyndŵr --> 7

AdmBorkBork

Posted 2016-09-12T17:01:20.077

Reputation: 41 581

I'm not familiar with PowerShell, but do you really need the parentheses around [prtc]h? – Jordan – 2016-09-12T17:38:53.457

@Jordan No, I do not. That's not a PowerShell thing, that's a me-not-good-at-regex thing. :D Thanks for the golf! – AdmBorkBork – 2016-09-12T18:03:36.207

2

V, 31 bytes

Íã[cprt]hüddüffüngüllü./
Dé0@"

Try it online, or Verify all test cases!

This contains some unprintable characters, so here is a hexdump:

0000000: cde3 5b63 7072 745d 68fc 6464 fc66 66fc  ..[cprt]h.dd.ff.
0000010: 6e67 fc6c 6cfc 2e2f 010a 44e9 3040 22    ng.ll../..D.0@"

James

Posted 2016-09-12T17:01:20.077

Reputation: 54 537

2

PHP , 56 Bytes

<?=preg_match_all("#[cprt]h|dd|ff|ll|ng|.#iu",$argv[1]);

Jörg Hülsermann

Posted 2016-09-12T17:01:20.077

Reputation: 13 026

1I believe [dfl]{2} matches df, ld, etc. as well as its intended matches. dd|ff|ll is the same length. – ETHproductions – 2016-09-12T20:14:02.623

1I know that your believe is true but I think that your believe s not a type of believe. it looks more than a type of kowledge – Jörg Hülsermann – 2016-09-12T20:34:10.663

1Instead of echo(space at the end), use <?=, which saves 2 bytes. Also, the $t isn't necessary there, saving you 3 more bytes. – Ismael Miguel – 2016-09-13T10:57:30.557

Thnak You Ismael . I must be more then a little confused that I not remove the $t – Jörg Hülsermann – 2016-09-13T11:14:16.193

2

Java 7, 156 73 bytes

Loads of bytes saved thanks to @OlivierGrégoire.

int c(String s){return s.replaceAll("[cprt]h|dd|ff|ng|ll","*").length();}

Ungolfed & test cases:

Try it here.

class M{
  static int c(String s){
    return s.replaceAll("[cprt]h|dd|ff|ng|ll", "*").length();
  }

  public static void main(String[] a){
    System.out.println(c("llandudno"));
    System.out.println(c("llanelli"));
    System.out.println(c("rhyl"));
    System.out.println(c("llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch"));
    System.out.println(c("tŷr"));
    System.out.println(c("cymru"));
    System.out.println(c("glyndŵr"));
  }
}

Output:

8
6
3
50
3
5
7

Kevin Cruijssen

Posted 2016-09-12T17:01:20.077

Reputation: 67 575

You import and then you don't use Matcher directly? :o Also, Matcher can be defined in the for loop. – Olivier Grégoire – 2016-09-13T13:03:35.917

1I have the strong feeling that return s.replaceAll("[cprt]h|dd|ff|ng|ll","a").length() is way, way shorter. Can't this work? – Olivier Grégoire – 2016-09-13T13:07:56.890

Well, yes, it works, and it's 73 bytes for the Java 7 version (int c(String s){return s.replaceAll("[cprt]h|dd|ff|ng|ll","a").length();}). And only 51 for the Java 8 version (s->s.replaceAll("[cprt]h|dd|ff|ng|ll","a").length()). – Olivier Grégoire – 2016-09-13T13:17:14.003

1@OlivierGrégoire Thanks. The Matcher was an accident. I had it correctly in the test code, but not in the golfed code.. >.> Your replaceAll works better though, thanks. – Kevin Cruijssen – 2016-09-13T13:27:48.600

1

Perl 6, 36 bytes

+*.comb(/:i.|<[cprt]>h|dd|ff|ng|ll/)

Try it online!

bb94

Posted 2016-09-12T17:01:20.077

Reputation: 1 831

1

R, 54 bytes

Very similar to the other answers. Matches any of the two character letters and replaces them with @ and subsequently counts the number of characters. Reads input from stdin. Uses the option ignore.case = TRUE (third argument to gsub) to match both upper and lowercase characters.

nchar(gsub("ch|dd|ff|ng|ll|ph|rh|th","@",scan(,""),T))

Bonus

Both gsub and nchar are vectorized which means that this also works on a character vector, e.g.:

v=c("Llandudno","Llanelli","Rhyl","Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch","Tŷr","Cymru","Glyndŵr")
nchar(gsub("ch|dd|ff|ng|ll|ph|rh|th","@",v,T))

produces:

[1]  8  6  3 50  3  5  7

Billywob

Posted 2016-09-12T17:01:20.077

Reputation: 3 363

0

tcl, 71

proc L s {string le [regsub -all -nocase ch|dd|ff|ng|ll|ph|rh|th $s @]}

demo

sergiol

Posted 2016-09-12T17:01:20.077

Reputation: 3 055

0

Perl 5, 35 + 1 (-p) = 36 bytes

s/[cprt]h|dd|ff|ng|ll/a/gi;$_=y///c

Try it online!

Xcali

Posted 2016-09-12T17:01:20.077

Reputation: 7 671

0

XQuery, 77 bytes

declare variable$s external;count(tokenize($s,'[cprt]h|ff|dd|ll|ng|.','i'))-1

Kniffler

Posted 2016-09-12T17:01:20.077

Reputation: 41