I work on password research and I have reviewed extensively zxcvbn related papers and those of other researchers in the field.
Yes, the answer referenced by @Arminius is correct in summarizing why this approach aims to improve security without compromising memorability. Please take a look at it.
The core of your question is whether it would be easy to make a dictionary that cracks those passwords with 4 different dictionary words.
It is true that the main argument of the above approaches only addresses the "Bruteforce" threat model, thus calculating the entropy without any consideration for an attacker who has access to the initial dictionary.
Their approach seems to suggest that all that matters is the password "strength" with no regards to how actual attacks are carried out.
Bruce Schneier made an interesting summary of previous commentaries on this.
It becomes more so if, as your premise goes, "everyone is using that scheme" and the attacker knows that.